goodai-ltm-benchmark
A library for benchmarking the Long Term Memory and Continual learning capabilities of LLM based agents. With all the tests and code you need to evaluate your own agents. See more in the blogpost:
Stars: 51
This repository contains code and data for replicating experiments on Long-Term Memory (LTM) abilities of conversational agents. It includes a benchmark for testing agents' memory performance over long conversations, evaluating tasks requiring dynamic memory upkeep and information integration. The repository supports various models, datasets, and configurations for benchmarking and reporting results.
README:
This repository contains the code and data to replicate our experiments regarding the Long-Term Memory (LTM) abilities of conversational agents. The benchmark was originally released jointly with a blogpost, check it out for obtaining more information about the benchmark and the related research goals.
As part of our research efforts in the area of continual learning, we are open-sourcing this benchmark for testing agents’ ability to perform tasks involving the advanced use of the memory over very long conversations. Among others, we evaluate the agent’s performance on tasks that require dynamic upkeep of memories or integration of information over long periods of time.
We are open-sourcing:
- The living GoodAI LTM Benchmark (this repository).
- Our LTM agents.
- Our experiment data and results.
These tests require python 3.10 or higher.
First, set your OPENAI_API_KEY, and optionally ANTHROPIC_API_KEY environment variables and clone the repository:
git clone [email protected]:GoodAI/goodai-ltm-benchmark.gitThe file run_benchmark.py can be executed by giving it a configuration .yml file using -c (examples are located in ./configurations/), an agent using -a (see below), and optionally a limit for the size of the context with -m.
For example, to run a benchmark on the GPT4-turbo LLM with a context size of 4096 tokens:
python run_benchmark.py -c ./configurations/published_benchmarks/<configuration_name>.yml \
-a gpt-4-turbo -m 4096This will generate a set of test specifications if there is not one already, and start to produce result files, one for each test. The result files will be located at ./tests/<benchmark_name>/results/<agent_name>/.
At the end of testing, an HTML report will be generated in data/reports which will give a detailed breakdown of the tests run, responses, and evaluations. It will be given a name of the form <time stamp> - Detailed Report - <run_name> - <agent_name>.html.
The agents that have been specifically implemented in this repository are the ones shown below. For implementing your own agent, please see the more detailed instructions here.
# OpenAI models
gpt-3.5-turbo # GPT3.5
gpt-4-turbo # latest GPT4-turbo
gpt-4o # latest GPT4o
# Anthropic Models (200k context)
claude-2.1 # Claude 2.1
claude-3-haiku # Claude 3 Haiku
claude-3-sonnet # Claude 3 Sonnet
claude-3-opus # Claude 3 Opus
# Google Gemini (1.5M-2M context)
gemini # Gemini 1.5 Pro
# Models with timestamped messages
ts-<model> # Any of the above OpenAI or Anthropic models
# GoodAI LTM models
# Variants:
# 1. semantic retrieval + query generation + JSON scratchpad
# 2. semantic retrieval
# 3. semantic retrieval + text scratchpad
# Optional model ID to use as core LLM
# Example: ltm_agent_1(claude-3-opus-20240229)
ltm_agent_<variant>[(<model>)]
# Memgpt
memgpt # An actively managed LTM/RAG conversational agent
# Cost Estimation
cost(<cost_in_tokens>,<cost_out_tokens>) # Model for estimating the cost of a
benchmark based on the input and output
costs
# Human models
human # A CLI interface for a human to use the tests.
In addition, we also support LLMs that are supported by litellm. To use external providers through litellm (e.g for together.ai) you should set your api key in either a .env file or as an environment variable:
export TOGETHERAI_API_KEY=sk-...And then call your agent in the form <api>/<author>/<model>. For example:
python run_benchmark.py -c ./configurations/published_benchmarks/benchmark-v3-500k.yml \
-a together_ai/meta-llama/Llama-3-70b-chat-hf -m 8000The configuration files used in the different versions of the benchmark can be found in configurations/published_benchmarks, in which <x>k denotes the memory span in thousands of tokens. For each of the benchmarks under a single version, we keep the scripts and needles the same, but we increase the amount of filler tokens owing to the larger memory span. Older configurations from previous releases can be found in published_benchmarks/legacy. These configuration files are compatible only with their corresponding releases and their operation is described in the readmes for those releases.
The datasets that are implemented for this benchmark can be found in ./datasets/. Briefly, they are:
chapterbreak
colours
jokes
locations_directions
name_list
prospective_memory
restaurant
sallyanne
shopping
spy_meeting
trigger_response
More details for each of the tests can be found from their descriptions inside each of their individual files.
The repository consists of four parts:
- Datasets: These are test generators, either through random combination of words, phrases, and numbers, sampling lines from an existent dataset, or generating them via a prompted GPT.
- Models: A model is an agent that can be set to perform the tasks of the dataset. This part presents a very simple interface and facilitates the integration of agents with the benchmark.
- Runner: This script takes a configuration and model specification, optionally generates the set of test instances, and executes the benchmark.
- Reports: These files generate the reports as self-contained HTML files, with support for individual and comparative reporting.
More details for each of these parts can be found here: datasets, models, runner, reports.
| Model | Context Tokens | Score / 11 | Time (m) | Cost ($) |
|---|---|---|---|---|
| Mixtral-8x7B Instruct 0.1 | 32768 | 5 | 10.25 | 0.15 |
| Mixtral-8x22B Instruct 0.1 | 65536 | 4.9 | 11 | 0.61 |
| Llama 3 70B Instruct | 8000 | 8.2 | 8.8 | 0.13 |
| GPT-3.5-turbo | 16384 | 4.1 | 6 | 0.13 |
| GPT-4 Turbo | 128000 | 7.9 | 18.5 | 6.94 |
| GPT-4o | 128000 | 7.6 | 8 | 3.08 |
| Claude 3 Opus | 200000 | 8.3 | 41 | 15.28 |
| Gemini 1.5 Pro | 2000000 | 7.4 | 58 | --- |
| LTMAgent 1 (Llama 3 70B) | 8000 | 8.4 | 26 | 0.65 |
| LTMAgent 1 (GPT-4-turbo) | 16384 | 9.2 | 68.3 | 9.81 |
| LTMAgent 1 (Claude) | 16384 | 8.7 | 99.5 | 0.52 |
| Model | Context Tokens | Score / 10 | Time (m) | Cost ($) |
|---|---|---|---|---|
| Mixtral-8x7B Instruct 0.1 | 32768 | 1.4 | 7.5 | 0.08 |
| Mixtral-8x22B Instruct 0.1 | 65536 | 5.6 | 97.2 | 0.93 |
| Llama 3 70B Instruct | 8000 | 1.9 | 4.5 | 0.08 |
| GPT-3.5-turbo | 16384 | 4.7 | 8.1 | 0.31 |
| GPT-4 Turbo | 128000 | 6.6 | 5.5 | 8.29 |
| GPT-4o | 128000 | 5.9 | 4.8 | 4.55 |
| Claude 3 Opus | 200000 | 7.8 | 41.8 | 19.19 |
| Gemini 1.5 Pro | 2000000 | 6.5 | 55 | --- |
| LTMAgent 1 (Llama 3 70B) | 8000 | 6.9 | 22.9 | 1.2 |
| LTMAgent 1 (GPT-4-turbo) | 16384 | 6.3 | 99 | 17.34 |
| LTMAgent 1 (Claude) | 16384 | 7.5 | 90.8 | 0.38 |
| Model | Context Tokens | Score / 11 | Time (m) | Cost ($) |
|---|---|---|---|---|
| Mixtral-8x7B Instruct 0.1 | 32768 | 0.1 | 9 | 0.06 |
| Mixtral-8x22B Instruct 0.1 | 65536 | 0.0 | 18 | 0.93 |
| Llama 3 70B Instruct | 8000 | 0.2 | 10.8 | 0.06 |
| GPT-3.5-turbo | 16384 | 0.1 | 5.5 | 0.06 |
| GPT-4 Turbo | 128000 | 4.8 | 18.5 | 77.74 |
| GPT-4o | 128000 | 4.6 | 15 | 38.38 |
| Claude 3 Opus | 200000 | 6.7 | 133.5 | 215.42 |
| Gemini 1.5 Pro | 2000000 | 6.4 | 39 | --- |
| LTMAgent 1 (Llama 3 70B) | 8000 | 5 | 43.7 | 2.50 |
| LTMAgent 1 (GPT-4-turbo) | 16384 | 5.2 | 171.9 | 61.46 |
| LTMAgent 1 (Claude) | 16384 | 5 | 173.2 | 0.68 |
| Model | Context Tokens | Score / 11 | Time (m) | Cost ($) |
|---|---|---|---|---|
| Mixtral-8x7B Instruct 0.1 | 32768 | 0.1 | 7.7 | 0.06 |
| Mixtral-8x22B Instruct 0.1 | 65536 | 0.1 | 21.1 | 1.12 |
| Llama 3 70B Instruct | 8000 | 0.2 | 9.4 | 0.06 |
| GPT-3.5-turbo | 16384 | 0.0 | 6 | 1.33 |
| GPT-4 Turbo | 128000 | 5.8 | 49 | 215.86 |
| GPT-4o | 128000 | 5.5 | 32 | 108.22 |
| Claude 3 Opus | 200000 | 7.4 | 519 | 476.68 |
| Gemini 1.5 Pro | 2000000 | 7.0 | --- | --- |
| LTMAgent 1 (Llama 3 70B) | 8000 | 4.7 | 86.5 | 3.10 |
| LTMAgent 1 (GPT-4-turbo) | 16384 | 5.0 | 567.5 | 89.36 |
| LTMAgent 1 (Claude) | 16384 | 5.7 | 307.5 | 158.24 |
| Model | Context Tokens | Score / 11 | Time (m) | Cost ($) |
|---|---|---|---|---|
| Mixtral-8x7B Instruct 0.1 | 32768 | 0.1 | 8.7 | 0.04 |
| Mixtral-8x22B Instruct 0.1 | 65536 | 0.1 | 14.5 | 1.21 |
| Llama 3 70B Instruct | 8000 | 0.2 | 8.0 | 0.06 |
| GPT-3.5-turbo | 16384 | 0.0 | 5.0 | 0.06 |
| GPT-4 Turbo | 128000 | 3.9 | 45.17 | 222.62 |
| GPT-4o | 128000 | 5.2 | 35.75 | 111.80 |
| Claude 3 Opus | 200000 | 5.4 | 338.43 | 502.28 |
| Gemini 1.5 Pro | 2000000 | 8.0 | 76 | --- |
| LTMAgent 1 (Llama 3 70B) | 8000 | 5.6 | 126.87 | 3.89 |
| LTMAgent 1 (GPT-4-turbo) | 16384 | 5.3 | 326.22 | 87.78 |
| LTMAgent 1 (Claude) | 16384 | 6.4 | 342.83 | 149.53 |
| Model | Context Tokens | Score / 11 | Time (m) | Cost ($) |
|---|---|---|---|---|
| Mixtral-8x7B Instruct 0.1 | 32768 | 0.1 | 8.5 | 0.07 |
| Mixtral-8x22B Instruct 0.1 | 65536 | 0.1 | 44 | 1.15 |
| Llama 3 70B Instruct | 8000 | 0.2 | 11.5 | 0.06 |
| GPT-3.5-turbo | 16384 | 0.0 | 6.5 | 0.06 |
| GPT-4 Turbo | 128000 | 1.0 | 48 | 223.16 |
| GPT-4o | 128000 | 0.9 | 38 | 111.49 |
| Claude 3 Opus | 200000 | 3.4 | 324.35 | 527.86 |
| Gemini 1.5 Pro | 2000000 | 5.3 | 82.5 | --- |
| LTMAgent 1 (Llama 3 70B) | 8000 | 4.8 | 250.23 | 6.13 |
| LTMAgent 1 (GPT-4-turbo) | 16384 | 3.1 | 1240.30 | 174.93 |
| LTMAgent 1 (Claude) | 16384 | 4.9 | 528.37 | 230.27 |
- Benchmark 1 (02/2024)
- Benchmark 2 (03/2024)
- Benchmark 3 (04/2024)
This project is licensed under the MIT License - see the LICENSE file for details. Use of this software requires attribution to the original author and project, as detailed in the license.
Some datasets use data generated by GPT, so those specific tests are unsuitable for commercial purposes.
- The filler is drawn from the TriviaQA dataset which is licenced under Apache 2.0.
- The data for the SallyAnne dataset (labelled
data/tomi_data/) was generated using this code implementing the paper Evaluating Theory of Mind in Question Answering, which is currently (as of 22/01/2024) unlicenced. - The ChapterBreak dataset is described in the paper ChapterBreak: A Challenge Dataset for Long-Range Language Models and the repository is found on GitHub. ChapterBreak is licenced under Apache 2.0.
- "The Complete Works of William Shakespeare" is public domain. This particular copy has been sourced from Project Gutenburg, whose terms of use can be found on their website.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for goodai-ltm-benchmark
Similar Open Source Tools
goodai-ltm-benchmark
This repository contains code and data for replicating experiments on Long-Term Memory (LTM) abilities of conversational agents. It includes a benchmark for testing agents' memory performance over long conversations, evaluating tasks requiring dynamic memory upkeep and information integration. The repository supports various models, datasets, and configurations for benchmarking and reporting results.
writing
The LLM Creative Story-Writing Benchmark evaluates large language models based on their ability to incorporate a set of 10 mandatory story elements in a short narrative. It measures constraint satisfaction and literary quality by grading models on character development, plot structure, atmosphere, storytelling impact, authenticity, and execution. The benchmark aims to assess how well models can adapt to rigid requirements, remain original, and produce cohesive stories using all assigned elements.
llm4regression
This project explores the capability of Large Language Models (LLMs) to perform regression tasks using in-context examples. It compares the performance of LLMs like GPT-4 and Claude 3 Opus with traditional supervised methods such as Linear Regression and Gradient Boosting. The project provides preprints and results demonstrating the strong performance of LLMs in regression tasks. It includes datasets, models used, and experiments on adaptation and contamination. The code and data for the experiments are available for interaction and analysis.
are-copilots-local-yet
Current trends and state of the art for using open & local LLM models as copilots to complete code, generate projects, act as shell assistants, automatically fix bugs, and more. This document is a curated list of local Copilots, shell assistants, and related projects, intended to be a resource for those interested in a survey of the existing tools and to help developers discover the state of the art for projects like these.
CogVLM2
CogVLM2 is a new generation of open source models that offer significant improvements in benchmarks such as TextVQA and DocVQA. It supports 8K content length, image resolution up to 1344 * 1344, and both Chinese and English languages. The project provides basic calling methods, fine-tuning examples, and OpenAI API format calling examples to help developers quickly get started with the model.
nntrainer
NNtrainer is a software framework for training neural network models on devices with limited resources. It enables on-device fine-tuning of neural networks using user data for personalization. NNtrainer supports various machine learning algorithms and provides examples for tasks such as few-shot learning, ResNet, VGG, and product rating. It is optimized for embedded devices and utilizes CBLAS and CUBLAS for accelerated calculations. NNtrainer is open source and released under the Apache License version 2.0.
BizFinBench
BizFinBench is a benchmark tool designed for evaluating large language models (LLMs) in logic-heavy and precision-critical domains such as finance. It comprises over 100,000 bilingual financial questions rooted in real-world business scenarios. The tool covers five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, mapped to nine fine-grained categories. BizFinBench aims to assess the capacity of LLMs in real-world financial scenarios and provides insights into their strengths and limitations.
open-llms
Open LLMs is a repository containing various Large Language Models licensed for commercial use. It includes models like T5, GPT-NeoX, UL2, Bloom, Cerebras-GPT, Pythia, Dolly, and more. These models are designed for tasks such as transfer learning, language understanding, chatbot development, code generation, and more. The repository provides information on release dates, checkpoints, papers/blogs, parameters, context length, and licenses for each model. Contributions to the repository are welcome, and it serves as a resource for exploring the capabilities of different language models.
TRACE
TRACE is a temporal grounding video model that utilizes causal event modeling to capture videos' inherent structure. It presents a task-interleaved video LLM model tailored for sequential encoding/decoding of timestamps, salient scores, and textual captions. The project includes various model checkpoints for different stages and fine-tuning on specific datasets. It provides evaluation codes for different tasks like VTG, MVBench, and VideoMME. The repository also offers annotation files and links to raw videos preparation projects. Users can train the model on different tasks and evaluate the performance based on metrics like CIDER, METEOR, SODA_c, F1, mAP, Hit@1, etc. TRACE has been enhanced with trace-retrieval and trace-uni models, showing improved performance on dense video captioning and general video understanding tasks.
RVC_CLI
RVC_CLI is a command line interface tool for retrieval-based voice conversion. It provides functionalities for installation, getting started, inference, training, UVR, additional features, and API integration. Users can perform tasks like single inference, batch inference, TTS inference, preprocess dataset, extract features, start training, generate index file, model extract, model information, model blender, launch TensorBoard, download models, audio analyzer, and prerequisites download. The tool is built on various projects like ContentVec, HIFIGAN, audio-slicer, python-audio-separator, RMVPE, FCPE, VITS, So-Vits-SVC, Harmonify, and others.
FlipAttack
FlipAttack is a jailbreak attack tool designed to exploit black-box Language Model Models (LLMs) by manipulating text inputs. It leverages insights into LLMs' autoregressive nature to construct noise on the left side of the input text, deceiving the model and enabling harmful behaviors. The tool offers four flipping modes to guide LLMs in denoising and executing malicious prompts effectively. FlipAttack is characterized by its universality, stealthiness, and simplicity, allowing users to compromise black-box LLMs with just one query. Experimental results demonstrate its high success rates against various LLMs, including GPT-4o and guardrail models.
linghe
A library of high-performance kernels for LLM training, linghe is designed for MoE training with FP8 quantization. It provides fused quantization kernels, memory-efficiency kernels, and implementation-optimized kernels. The repo benchmarks on H800 with specific configurations and offers examples in tests. Users can refer to the API for more details.
awesome-local-llms
The 'awesome-local-llms' repository is a curated list of open-source tools for local Large Language Model (LLM) inference, covering both proprietary and open weights LLMs. The repository categorizes these tools into LLM inference backend engines, LLM front end UIs, and all-in-one desktop applications. It collects GitHub repository metrics as proxies for popularity and active maintenance. Contributions are encouraged, and users can suggest additional open-source repositories through the Issues section or by running a provided script to update the README and make a pull request. The repository aims to provide a comprehensive resource for exploring and utilizing local LLM tools.
RVC_CLI
**RVC_CLI: Retrieval-based Voice Conversion Command Line Interface** This command-line interface (CLI) provides a comprehensive set of tools for voice conversion, enabling you to modify the pitch, timbre, and other characteristics of audio recordings. It leverages advanced machine learning models to achieve realistic and high-quality voice conversions. **Key Features:** * **Inference:** Convert the pitch and timbre of audio in real-time or process audio files in batch mode. * **TTS Inference:** Synthesize speech from text using a variety of voices and apply voice conversion techniques. * **Training:** Train custom voice conversion models to meet specific requirements. * **Model Management:** Extract, blend, and analyze models to fine-tune and optimize performance. * **Audio Analysis:** Inspect audio files to gain insights into their characteristics. * **API:** Integrate the CLI's functionality into your own applications or workflows. **Applications:** The RVC_CLI finds applications in various domains, including: * **Music Production:** Create unique vocal effects, harmonies, and backing vocals. * **Voiceovers:** Generate voiceovers with different accents, emotions, and styles. * **Audio Editing:** Enhance or modify audio recordings for podcasts, audiobooks, and other content. * **Research and Development:** Explore and advance the field of voice conversion technology. **For Jobs:** * Audio Engineer * Music Producer * Voiceover Artist * Audio Editor * Machine Learning Engineer **AI Keywords:** * Voice Conversion * Pitch Shifting * Timbre Modification * Machine Learning * Audio Processing **For Tasks:** * Convert Pitch * Change Timbre * Synthesize Speech * Train Model * Analyze Audio
LLM-QAT
This repository contains the training code of LLM-QAT for large language models. The work investigates quantization-aware training for LLMs, including quantizing weights, activations, and the KV cache. Experiments were conducted on LLaMA models of sizes 7B, 13B, and 30B, at quantization levels down to 4-bits. Significant improvements were observed when quantizing weight, activations, and kv cache to 4-bit, 8-bit, and 4-bit, respectively.
models
The Intel® AI Reference Models repository contains links to pre-trained models, sample scripts, best practices, and tutorials for popular open-source machine learning models optimized by Intel to run on Intel® Xeon® Scalable processors and Intel® Data Center GPUs. It aims to replicate the best-known performance of target model/dataset combinations in optimally-configured hardware environments. The repository will be deprecated upon the publication of v3.2.0 and will no longer be maintained or published.
For similar tasks
goodai-ltm-benchmark
This repository contains code and data for replicating experiments on Long-Term Memory (LTM) abilities of conversational agents. It includes a benchmark for testing agents' memory performance over long conversations, evaluating tasks requiring dynamic memory upkeep and information integration. The repository supports various models, datasets, and configurations for benchmarking and reporting results.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.
