yet-another-applied-llm-benchmark
A benchmark to evaluate language models on questions I've previously asked them to solve.
Stars: 790
Yet Another Applied LLM Benchmark is a collection of diverse tests designed to evaluate the capabilities of language models in performing real-world tasks. The benchmark includes tests such as converting code, decompiling bytecode, explaining minified JavaScript, identifying encoding formats, writing parsers, and generating SQL queries. It features a dataflow domain-specific language for easily adding new tests and has nearly 100 tests based on actual scenarios encountered when working with language models. The benchmark aims to assess whether models can effectively handle tasks that users genuinely care about.
README:
This is a benchmark I made, for me, to test how well language models perform on tasks I care about. I know I care about them because each test is directly derived from something I've asked a LLM to perform for me in the last year.
For example, there are tests in this benchmark that evaluate the ability of a model to:
- convert a python function to an equivalent-but-faster c function;
- decompile python bytecode to functional source code;
- explain the functionality of minified javascript;
- identify the encoding format (in this case, uuencoded) of some data;
- write a parser from a BNF-like grammar;
- convert some english sentences to SQL queries; or,
- write some bash oneliners
There are two defining features of this benchmark that make it interesting:
-
I've implemented a simple dataflow domain specific language to make it easy for me (or anyone else!) to add new tests that realistically evaluate model capabilities.
-
Directly as a result of this, I've written nearly 100 tests for different situations I've actually encountered when working with LLMs as assistants.
For example, here's a complete test to evaluate if a model can write a "hello world" program.
'Write a "hello world" program in python' >> LLMRun() >> PythonRun() >> SubstringEvaluator("hello world")
I first ask the model to write hello world, then run that program (in a docker
container, seamlessly behind the scenes), and then check if stdout contains "hello world".
(You should read the >>
operator as "and then do". So "a >> b" means "do a, and then do b".)
More interestingly, here's a test case that first asks a model to produce code that draws a flag. To evaluate it, I run the code, partially judge it with another language model, and then fully evaluate the output by comparing to a reference solution.
"Write a C program that draws an american flag to stdout." >> LLMRun() >> CRun() >> \
LLMRun("What flag is shown in this image?") >> \
(SubstringEvaluator("United States") | SubstringEvaluator("USA") | SubstringEvaluator("America"))
This DSL makes it easy for me to evaluate significantly more diverse and more sophisticated behavior than any other evaluation benchmark I'm aware of. This is helpful for determining whether or not models are capable of performing tasks I actually care about.
I've evaluated a few models on this benchmark. Here's how they perform:
- Claude 3.5 Sonnet: 48% passed
- GPT 4o: 47% passed
- Claude 3 Opus: 42% passed
- Claude 3 Sonnet: 32% passed
- Gemini 1.5 Pro: 32% passed
- Mistral Large: 28% passed
- GPT 3.5: 26% passed
- Mistral Medium: 23% passed
- Gemini 1.0 Pro: 17% passed
A complete evaluation grid is available here.
A serious academic benchmark.
In more words: this is not meant to try to rigorously evaluate the capabilities of models on any particular task. It's not meant to be something you can use to decide which model is more capable, more knowledgeable, more factual, less biased, less harmful, more aligned, more helpful, or anything else.
The questions are not optimally prompt-engineered. It is entirely possible---and indeed likely!---that a better phrasing of some of the questions would allow the model to give a better answer.
But I am lazy.
I do not want to remind the model it is AN EXPERT IN PYTHON and tell it that I'll give it a $100,000 tip for giving the right answer OR I WILL MURDER A KITTEN but please pause....take a deep breath....and think step by step by step before answering. (Or whatever the current incantation is people use to get models to work best.)
I just want to type my question and get the right answer. So this benchmark tests for that, on types of questions I've actually cared about having answered.
As a result of my (often intentional) lack of prompt engineering, when a model fails a question, you won't learn very much. Maybe my question was just poorly worded. Maybe it was ambiguous in some way.
Instead, these tests are designed so that I learn something when the model passes. You don't luck your way into correctly compiling Rust programs without having some skill at the language. But you might luck your way into failing by naming the function something I didn't expect and so your correct code just is never invoked.
Again, it's just a collection of questions I've actually asked language models to solve for me
to help with various programming tasks,
interspursed with a few questions I've asked language models just for fun.
The questions are, for the most part, unmodified questions as I typed them.
This means they may not be the most clearly worded
(e.g., In python what __thing__ do I use for ~, kind of like how __add__ is for +
,
with the answser I'm expecting is __inv__
).
Other questions are "unfair" because they require recent knowledge
(e.g., "what is the hidden dimension of llama-2 70b?").
But I care if a model can answer these correctly for me.
Getting this benchmark up and running is fairly straightforward.
On the python side you'll just need to run
pip install -r requirements.txt
to install the python dependencies.
If you want to run it and evaluate a wide range of models you'll also need
pip install -r requirements-extra.txt
to install the other models.
I want to run things in a container to keep them basically safe. Docker is nicer and has slightly better security controls (and so you can use that if you want below) but on linux you need to be root or give your user almost-root permissions to start new docker jobs. This scares me a bit.
So I prefer to use podman. Install it however you're supposed to for your system.
Again this is fairly system dependent so you'll have to go somewhere else to find out how to install it for your system.
The test cases in this benchmark are evaluated by directly executing code that comes out of a language model. Some tests ask the model to rename files, move files around, or make other state-changing operations to your machine.
While I don't think these models have it out for us and will emit rm -rf /
out of
malice or spite, it's entirely possible (and even likely!) that they'll produce buggy
code that will just accidentally trash your computer.
So, to safeguard against this, all LLM output is evaluated from within a
temporary docker container that gets deleted immediately after the test is complete.
(There's also another reason, though: some of the tests assume a fresh install of Ubuntu with particular dependencies in various places. These tests might behave differently on your local machine than they do from within the docker VM.)
If you like to live dangerously (VERY MUCH NOT RECOMENDED) then there is
a flag in the code
I_HAVE_BLIND_FAITH_IN_LLMS_AND_AM_OKAY_WITH_THEM_BRICKING_MY_MACHINE_OR_MAKING_THEM_HALT_AND_CATCH_FIRE
that you can set to True and then this will just eval() everything that comes
out of the LLMs on your machine directly.
Once you've installed everything, there are a few setup steps before you can run the benchmark.
You should add API keys for any model you want to evaluate. The keys are stored in the config.json file. You can find a template at config.json.example
Whatever model you are testing, you will also need to load API keys for OpenAI as the default evaluation model. This is because a few of the questions require evaluation by a second language model to judge correctness. These secondary evaluations are as simple as possible, but using a high-quality model here is helpful to ensure consistency in the results.
I have had good success using gpt-4-turbo as the evaluation model, but you can configure any model that you want as the evaluator. In my experiments, I had almost identical results with the (cheaper) gpt-3.5-turbo, but in a few cases having the more capable evaluation model gives more reliable results.
To start you'll need to create the docker container where the tests will run. This will first require that you install docker on your machine. Once you've done that, you can then build the image:
docker build -t llm-benchmark-image . # if you're using docker
podman build -t llm-benchmark-image . # if you're using podman
A few test cases require Selenium and Chrome to test if models can generate valid html/javascript programs. Installing the requirements file should install selenium for you, but you'll also need to make sure you install chrome. If you're on ubuntu then you can just run
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome-stable_current_amd64.deb
Once you've set up your environment, you can run the entire benchmark in just one line:
python main.py --model gpt-3.5-turbo --run-tests --generate-report
This command will run every single test that's configured on one model.
It will therefore take some time, and also will cost you a few dollars in
language model queries. After you can view the full reslt html file in the
directory evaluation_examples
.
It will also save a cache of this run, so that the next time you can run a new model and view the two results side-by-side. These are saved by default in the directory results/[current git commit hash]/[model name].
If you want to run individual test cases, you can do that too in two ways. One is to just directly run test
PYTHONPATH='.' python tests/print_hello.py
- Explore the
run_a_simple_testcase.ipynb
notebook to quickly run a sample test case on Colab.
The other, if you want to save the result of this run so you can load it later, is to run the main script and specify which test(s) you want to run. (Be careful if you do this, though, beacuse it will overwrite any prior run.)
python main.py --run-tests --test print_hello --model gpt-3.5-turbo-0125
If you've run generated many saved runs previously, you can load them into one grid with
python main.py --load-saved --generate-report --model [model1] --model [model2]
And finally, if you've run the tests previously at one git commit, and want to just run any tests that have changed since then, you can run
python main.py --run-tests --only-changed [commit hash] --model gpt-3.5-turbo-0125
This framework is fairly modular. If you'd like to test a new LLM, you'll just need to add a new file to llms/[model_name]_model.py which implements an inference function. From there, modify llm.py to include the appropriate model.
As discussed above, I've built a small dataflow DSL to implement test cases. This should let anyone add new test cases fairly easily. Let me repeat the initial test case examples in full:
TestHelloWorld = 'Print "hello world" in python' >> LLMRun() >> \
ExtractCode() >> PythonRun() >> SubstringEvaluator("hello world")
Recall that you should read the >>
operator as "and then do".
If you're a bash person, it's like a |
pipe.
If you're a haskel person, it's like the $
operator.
- We start by running the LLM we want to test with the prompt: Print "hello world" in python.
- The model will probably give some code, but might also give an explanation or start by saying "Sure! I can answer your question." So we take whatever output came out of the model and pass it through a function to just extract the first code block.
- We then actually go and run this python code, whatever it is. To be somewhat safe we do this by spawning a new docker env and run the code there.
- And finally, we verify that the code was correct, and the output of the python contains a substring of "hello world".
This dataflow pipeline also lets you do more complex verification. Below is the same example as from the top where I evaluate a model's ability to understand how to draw .bmp images.
TestFlagDraw = "Write a C program that draws an american flag to stdout as a bmp." >> \
ExtractCode() >> CRun() >> LLMVisionRun("What flag is shown in this image?") >> \
(SubstringEvaluator("United States") | \
SubstringEvaluator("USA") | \
SubstringEvaluator("America"))
Here, after asking the model to draw the flag and running the resulting C code, I evaluate the model by asking another model what flag has been drawn, and checking if it says something like the US flag. Is this a perfect check? No. But verification is usually easier than generation, and so it's probably a good enough approximation of what I want.
If you'd like to add your own tests to this benchmark feel free to open a PR! I'd be happy to accept basically anything interesting.
There are only a few requirements for adding a test.
-
Test cases must be mechanistically verifiable. This is very limiting, I know. A whole lot of what I use LLMs for isn't verifiable in this way. Especially when I'm giving them large blocks of code and asking for specific changes that are hard to unit test. But in order for these to be useful your test must be easy to verify.
-
Test cases should complete quickly. I don't want to wait several minutes just for one test to run.
-
Tests should not be evaluated against LLMs during construction. Don't modify the test because the model gave an answer you didn't like. Most LLMs are stochastic enough that there is some way to elicit most behavior with enough trial and error. I want to see how the model answers with a human-written test, as they are normally asked, before LM refinement.
-
Tests should be designed so that passing demonstrates some interesting model capability. Making "gotcha" tests that are designed to show models fail in some way are not useful in this setup.
-
Test cases must not download large amounts of data from the internet. Someone else shouldn't have to pay for each run of this benchmark. If you need to test a library add it to the Dockerfile.
Are there any tests here that are broken? I tried my best to make them all correct but can't guarantee correctness for sure. If so I'd be happy to accept fixes.
But please note: a broken test means one where the answer is objectively wrong. Like a test that says 6 is prime. A test that just expects a specific answer to an ambiguous question is not wrong. For example, one test asks "What do I do to fix AutoModel.from_pretrained to make it auto model with lm head" and expects the model to tell me that I should be using the class "AutoModelForCausalLM"; even though the class "AutoModelWithLMHead" exists, that's not what I was looking for.
No you probably don't. At least, you probably don't if you're trying to compare why your new model is better or something. This is not meant to be something for academic papers and only evaluates a very specific set of capabilities. For all the reasons mentioned earlier I don't think this benchmark will accurately capture what academic people should care about for their models. Good for "useful for me?": yes. Good for "is my model better?": I don't think so. But I've now had at least a few people ask me about this who appear unswayed by the above argument.
So here's my answer: if you want to user this in a paper, then link to this github project AND INCLUDE THE GIT COMMIT HASH YOU USED. I make NO GUARANTEES that I won't just arbitrarily edit test cases without warning. In fact, it's already happened in #1! And #3! And #6. So if you want your paper to be at all scientific make sure to include the git commit hash.
Copyright (C) 2024, Nicholas Carlini [email protected].
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for yet-another-applied-llm-benchmark
Similar Open Source Tools
yet-another-applied-llm-benchmark
Yet Another Applied LLM Benchmark is a collection of diverse tests designed to evaluate the capabilities of language models in performing real-world tasks. The benchmark includes tests such as converting code, decompiling bytecode, explaining minified JavaScript, identifying encoding formats, writing parsers, and generating SQL queries. It features a dataflow domain-specific language for easily adding new tests and has nearly 100 tests based on actual scenarios encountered when working with language models. The benchmark aims to assess whether models can effectively handle tasks that users genuinely care about.
AnnA_Anki_neuronal_Appendix
AnnA is a Python script designed to create filtered decks in optimal review order for Anki flashcards. It uses Machine Learning / AI to ensure semantically linked cards are reviewed far apart. The script helps users manage their daily reviews by creating special filtered decks that prioritize reviewing cards that are most different from the rest. It also allows users to reduce the number of daily reviews while increasing retention and automatically identifies semantic neighbors for each note.
discourse-chatbot
The discourse-chatbot is an original AI chatbot for Discourse forums that allows users to converse with the bot in posts or chat channels. Users can customize the character of the bot, enable RAG mode for expert answers, search Wikipedia, news, and Google, provide market data, perform accurate math calculations, and experiment with vision support. The bot uses cutting-edge Open AI API and supports Azure and proxy server connections. It includes a quota system for access management and can be used in RAG mode or basic bot mode. The setup involves creating embeddings to make the bot aware of forum content and setting up bot access permissions based on trust levels. Users must obtain an API token from Open AI and configure group quotas to interact with the bot. The plugin is extensible to support other cloud bots and content search beyond the provided set.
aiohomekit
aiohomekit is a Python library that implements the HomeKit protocol for controlling HomeKit accessories using asyncio. It is primarily used with Home Assistant, targeting the same versions of Python and following their code standards. The library is still under development and does not offer API guarantees yet. It aims to match the behavior of real HAP controllers, even when not strictly specified, and works around issues like JSON formatting, boolean encoding, header sensitivity, and TCP packet splitting. aiohomekit is primarily tested with Phillips Hue and Eve Extend bridges via Home Assistant, but is known to work with many more devices. It does not support BLE accessories and is intended for client-side use only.
WeeaBlind
Weeablind is a program that uses modern AI speech synthesis, diarization, language identification, and voice cloning to dub multi-lingual media and anime. It aims to create a pleasant alternative for folks facing accessibility hurdles such as blindness, dyslexia, learning disabilities, or simply those that don't enjoy reading subtitles. The program relies on state-of-the-art technologies such as ffmpeg, pydub, Coqui TTS, speechbrain, and pyannote.audio to analyze and synthesize speech that stays in-line with the source video file. Users have the option of dubbing every subtitle in the video, setting the start and end times, dubbing only foreign-language content, or full-blown multi-speaker dubbing with speaking rate and volume matching.
modelbench
ModelBench is a tool for running safety benchmarks against AI models and generating detailed reports. It is part of the MLCommons project and is designed as a proof of concept to aggregate measures, relate them to specific harms, create benchmarks, and produce reports. The tool requires LlamaGuard for evaluating responses and a TogetherAI account for running benchmarks. Users can install ModelBench from GitHub or PyPI, run tests using Poetry, and create benchmarks by providing necessary API keys. The tool generates static HTML pages displaying benchmark scores and allows users to dump raw scores and manage cache for faster runs. ModelBench is aimed at enabling users to test their own models and create tests and benchmarks.
ClipboardConqueror
Clipboard Conqueror is a multi-platform omnipresent copilot alternative. Currently requiring a kobold united or openAI compatible back end, this software brings powerful LLM based tools to any text field, the universal copilot you deserve. It simply works anywhere. No need to sign in, no required key. Provided you are using local AI, CC is a data secure alternative integration provided you trust whatever backend you use. *Special thank you to the creators of KoboldAi, KoboldCPP, llamma, openAi, and the communities that made all this possible to figure out.
bidirectional_streaming_ai_voice
This repository contains Python scripts that enable two-way voice conversations with Anthropic Claude, utilizing ElevenLabs for text-to-speech, Faster-Whisper for speech-to-text, and Pygame for audio playback. The tool operates by transcribing human audio using Faster-Whisper, sending the transcription to Anthropic Claude for response generation, and converting the LLM's response into audio using ElevenLabs. The audio is then played back through Pygame, allowing for a seamless and interactive conversation between the user and the AI. The repository includes variations of the main script to support different operating systems and configurations, such as using CPU transcription on Linux or employing the AssemblyAI API instead of Faster-Whisper.
femtoGPT
femtoGPT is a pure Rust implementation of a minimal Generative Pretrained Transformer. It can be used for both inference and training of GPT-style language models using CPUs and GPUs. The tool is implemented from scratch, including tensor processing logic and training/inference code of a minimal GPT architecture. It is a great start for those fascinated by LLMs and wanting to understand how these models work at deep levels. The tool uses random generation libraries, data-serialization libraries, and a parallel computing library. It is relatively fast on CPU and correctness of gradients is checked using the gradient-check method.
Web-LLM-Assistant-Llama-cpp
Web-LLM Assistant is a simple web search assistant that leverages a large language model (LLM) running via Llama.cpp to provide informative and context-aware responses to user queries. It combines the power of LLMs with real-time web searching capabilities, allowing it to access up-to-date information and synthesize comprehensive answers. The tool performs web searches, collects and scrapes information from search results, refines search queries, and provides answers based on the acquired information. Users can interact with the tool by asking questions or requesting web searches, making it a valuable resource for obtaining information beyond the LLM's training data.
AI-HF_Patch
AI-HF_Patch is a comprehensive patch for AI-Shoujo that includes all free updates, fan-made English translations, essential mods, and gameplay improvements. It ensures compatibility with character cards and scenes while maintaining the original game's feel. The patch addresses common issues and provides uncensoring options. Users can support development through Patreon. The patch does not include the full game or pirated content, requiring a separate purchase from Steam. Installation is straightforward, with detailed guides available for users.
chaiNNer
ChaiNNer is a node-based image processing GUI aimed at making chaining image processing tasks easy and customizable. It gives users a high level of control over their processing pipeline and allows them to perform complex tasks by connecting nodes together. ChaiNNer is cross-platform, supporting Windows, MacOS, and Linux. It features an intuitive drag-and-drop interface, making it easy to create and modify processing chains. Additionally, ChaiNNer offers a wide range of nodes for various image processing tasks, including upscaling, denoising, sharpening, and color correction. It also supports batch processing, allowing users to process multiple images or videos at once.
tracking-aircraft
This repository provides a demo that tracks aircraft using Redis and Node.js by receiving aircraft transponder broadcasts through a software-defined radio (SDR) and storing them in Redis. The demo includes instructions for setting up the hardware and software components required for tracking aircraft. It consists of four main components: Radio Ingestor, Flight Server, Flight UI, and Redis. The Radio Ingestor captures transponder broadcasts and writes them to a Redis event stream, while the Flight Server consumes the event stream, enriches the data, and provides APIs to query aircraft status. The Flight UI presents flight data to users in map and detail views. Users can run the demo by setting up the hardware, installing SDR software, and running the components using Docker or Node.js.
ai-agents-masterclass
AI Agents Masterclass is a repository dedicated to teaching developers how to use AI agents to transform businesses and create powerful software. It provides weekly videos with accompanying code folders, guiding users on setting up Python environments, using environment variables, and installing necessary packages to run the code. The focus is on Large Language Models that can interact with the outside world to perform tasks like drafting emails, booking appointments, and managing tasks, enabling users to create innovative applications with minimal coding effort.
MisguidedAttention
MisguidedAttention is a collection of prompts designed to challenge the reasoning abilities of large language models by presenting them with modified versions of well-known thought experiments, riddles, and paradoxes. The goal is to assess the logical deduction capabilities of these models and observe any shortcomings or fallacies in their responses. The repository includes a variety of prompts that test different aspects of reasoning, such as decision-making, probability assessment, and problem-solving. By analyzing how language models handle these challenges, researchers can gain insights into their reasoning processes and potential biases.
For similar tasks
ivy
Ivy is an open-source machine learning framework that enables you to: * 🔄 **Convert code into any framework** : Use and build on top of any model, library, or device by converting any code from one framework to another using `ivy.transpile`. * ⚒️ **Write framework-agnostic code** : Write your code once in `ivy` and then choose the most appropriate ML framework as the backend to leverage all the benefits and tools. Join our growing community 🌍 to connect with people using Ivy. **Let's** unify.ai **together 🦾**
yet-another-applied-llm-benchmark
Yet Another Applied LLM Benchmark is a collection of diverse tests designed to evaluate the capabilities of language models in performing real-world tasks. The benchmark includes tests such as converting code, decompiling bytecode, explaining minified JavaScript, identifying encoding formats, writing parsers, and generating SQL queries. It features a dataflow domain-specific language for easily adding new tests and has nearly 100 tests based on actual scenarios encountered when working with language models. The benchmark aims to assess whether models can effectively handle tasks that users genuinely care about.
aide
Aide is a Visual Studio Code extension that offers AI-powered features to help users master any code. It provides functionalities such as code conversion between languages, code annotation for readability, quick copying of files/folders as AI prompts, executing custom AI commands, defining prompt templates, multi-file support, setting keyboard shortcuts, and more. Users can enhance their productivity and coding experience by leveraging Aide's intelligent capabilities.
For similar jobs
sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.
teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.
chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.
BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students
uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.
griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.