yet-another-applied-llm-benchmark

A benchmark to evaluate language models on questions I've previously asked them to solve.

Stars: 790

Visit

Yet Another Applied LLM Benchmark is a collection of diverse tests designed to evaluate the capabilities of language models in performing real-world tasks. The benchmark includes tests such as converting code, decompiling bytecode, explaining minified JavaScript, identifying encoding formats, writing parsers, and generating SQL queries. It features a dataflow domain-specific language for easily adding new tests and has nearly 100 tests based on actual scenarios encountered when working with language models. The benchmark aims to assess whether models can effectively handle tasks that users genuinely care about.

README:

Yet Another Applied LLM Benchmark

This is a benchmark I made, for me, to test how well language models perform on tasks I care about. I know I care about them because each test is directly derived from something I've asked a LLM to perform for me in the last year.

For example, there are tests in this benchmark that evaluate the ability of a model to:

convert a python function to an equivalent-but-faster c function;
decompile python bytecode to functional source code;
explain the functionality of minified javascript;
identify the encoding format (in this case, uuencoded) of some data;
write a parser from a BNF-like grammar;
convert some english sentences to SQL queries; or,
write some bash oneliners

There are two defining features of this benchmark that make it interesting:

I've implemented a simple dataflow domain specific language to make it easy for me (or anyone else!) to add new tests that realistically evaluate model capabilities.
Directly as a result of this, I've written nearly 100 tests for different situations I've actually encountered when working with LLMs as assistants.

For example, here's a complete test to evaluate if a model can write a "hello world" program.

'Write a "hello world" program in python' >> LLMRun() >> PythonRun() >> SubstringEvaluator("hello world")

I first ask the model to write hello world, then run that program (in a docker container, seamlessly behind the scenes), and then check if stdout contains "hello world". (You should read the >> operator as "and then do". So "a >> b" means "do a, and then do b".)

More interestingly, here's a test case that first asks a model to produce code that draws a flag. To evaluate it, I run the code, partially judge it with another language model, and then fully evaluate the output by comparing to a reference solution.

"Write a C program that draws an american flag to stdout." >> LLMRun() >> CRun() >> \
    LLMRun("What flag is shown in this image?") >> \
        (SubstringEvaluator("United States") | SubstringEvaluator("USA") | SubstringEvaluator("America"))

This DSL makes it easy for me to evaluate significantly more diverse and more sophisticated behavior than any other evaluation benchmark I'm aware of. This is helpful for determining whether or not models are capable of performing tasks I actually care about.

Results

I've evaluated a few models on this benchmark. Here's how they perform:

Claude 3.5 Sonnet: 48% passed
GPT 4o: 47% passed
Claude 3 Opus: 42% passed
Claude 3 Sonnet: 32% passed
Gemini 1.5 Pro: 32% passed
Mistral Large: 28% passed
GPT 3.5: 26% passed
Mistral Medium: 23% passed
Gemini 1.0 Pro: 17% passed

A complete evaluation grid is available here.

What this is not

A serious academic benchmark.

In more words: this is not meant to try to rigorously evaluate the capabilities of models on any particular task. It's not meant to be something you can use to decide which model is more capable, more knowledgeable, more factual, less biased, less harmful, more aligned, more helpful, or anything else.

The questions are not optimally prompt-engineered. It is entirely possible---and indeed likely!---that a better phrasing of some of the questions would allow the model to give a better answer.

But I am lazy.

I do not want to remind the model it is AN EXPERT IN PYTHON and tell it that I'll give it a $100,000 tip for giving the right answer OR I WILL MURDER A KITTEN but please pause....take a deep breath....and think step by step by step before answering. (Or whatever the current incantation is people use to get models to work best.)

I just want to type my question and get the right answer. So this benchmark tests for that, on types of questions I've actually cared about having answered.

Failing a question doesn't mean much

As a result of my (often intentional) lack of prompt engineering, when a model fails a question, you won't learn very much. Maybe my question was just poorly worded. Maybe it was ambiguous in some way.

Instead, these tests are designed so that I learn something when the model passes. You don't luck your way into correctly compiling Rust programs without having some skill at the language. But you might luck your way into failing by naming the function something I didn't expect and so your correct code just is never invoked.

What this is

Again, it's just a collection of questions I've actually asked language models to solve for me to help with various programming tasks, interspursed with a few questions I've asked language models just for fun. The questions are, for the most part, unmodified questions as I typed them. This means they may not be the most clearly worded (e.g., In python what __thing__ do I use for ~, kind of like how __add__ is for +, with the answser I'm expecting is __inv__). Other questions are "unfair" because they require recent knowledge (e.g., "what is the hidden dimension of llama-2 70b?"). But I care if a model can answer these correctly for me.

Installing

Getting this benchmark up and running is fairly straightforward.

Python requirements

On the python side you'll just need to run pip install -r requirements.txt to install the python dependencies.

If you want to run it and evaluate a wide range of models you'll also need pip install -r requirements-extra.txt to install the other models.

Podman (preferred)

I want to run things in a container to keep them basically safe. Docker is nicer and has slightly better security controls (and so you can use that if you want below) but on linux you need to be root or give your user almost-root permissions to start new docker jobs. This scares me a bit.

So I prefer to use podman. Install it however you're supposed to for your system.

Docker (optional)

Again this is fairly system dependent so you'll have to go somewhere else to find out how to install it for your system.

Why do I need docker/podman?

The test cases in this benchmark are evaluated by directly executing code that comes out of a language model. Some tests ask the model to rename files, move files around, or make other state-changing operations to your machine.

While I don't think these models have it out for us and will emit rm -rf / out of malice or spite, it's entirely possible (and even likely!) that they'll produce buggy code that will just accidentally trash your computer. So, to safeguard against this, all LLM output is evaluated from within a temporary docker container that gets deleted immediately after the test is complete.

(There's also another reason, though: some of the tests assume a fresh install of Ubuntu with particular dependencies in various places. These tests might behave differently on your local machine than they do from within the docker VM.)

If you like to live dangerously (VERY MUCH NOT RECOMENDED) then there is a flag in the code I_HAVE_BLIND_FAITH_IN_LLMS_AND_AM_OKAY_WITH_THEM_BRICKING_MY_MACHINE_OR_MAKING_THEM_HALT_AND_CATCH_FIRE that you can set to True and then this will just eval() everything that comes out of the LLMs on your machine directly.

Setup

Once you've installed everything, there are a few setup steps before you can run the benchmark.

Add API keys

You should add API keys for any model you want to evaluate. The keys are stored in the config.json file. You can find a template at config.json.example

Whatever model you are testing, you will also need to load API keys for OpenAI as the default evaluation model. This is because a few of the questions require evaluation by a second language model to judge correctness. These secondary evaluations are as simple as possible, but using a high-quality model here is helpful to ensure consistency in the results.

I have had good success using gpt-4-turbo as the evaluation model, but you can configure any model that you want as the evaluator. In my experiments, I had almost identical results with the (cheaper) gpt-3.5-turbo, but in a few cases having the more capable evaluation model gives more reliable results.

Set up docker/podman container [highly recommended]

To start you'll need to create the docker container where the tests will run. This will first require that you install docker on your machine. Once you've done that, you can then build the image:

docker build -t llm-benchmark-image . # if you're using docker
podman build -t llm-benchmark-image . # if you're using podman

Set up selenium/chrome

A few test cases require Selenium and Chrome to test if models can generate valid html/javascript programs. Installing the requirements file should install selenium for you, but you'll also need to make sure you install chrome. If you're on ubuntu then you can just run

wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome-stable_current_amd64.deb

Running the benchmark

Once you've set up your environment, you can run the entire benchmark in just one line:

python main.py --model gpt-3.5-turbo --run-tests --generate-report

This command will run every single test that's configured on one model. It will therefore take some time, and also will cost you a few dollars in language model queries. After you can view the full reslt html file in the directory evaluation_examples.

It will also save a cache of this run, so that the next time you can run a new model and view the two results side-by-side. These are saved by default in the directory results/[current git commit hash]/[model name].

If you want to run individual test cases, you can do that too in two ways. One is to just directly run test

PYTHONPATH='.' python tests/print_hello.py

Explore the run_a_simple_testcase.ipynb notebook to quickly run a sample test case on Colab.

The other, if you want to save the result of this run so you can load it later, is to run the main script and specify which test(s) you want to run. (Be careful if you do this, though, beacuse it will overwrite any prior run.)

python main.py --run-tests --test print_hello --model gpt-3.5-turbo-0125

If you've run generated many saved runs previously, you can load them into one grid with

python main.py --load-saved --generate-report --model [model1] --model [model2]

And finally, if you've run the tests previously at one git commit, and want to just run any tests that have changed since then, you can run

python main.py --run-tests --only-changed [commit hash] --model gpt-3.5-turbo-0125

Modifying the benchmark

Adding your own LLM

This framework is fairly modular. If you'd like to test a new LLM, you'll just need to add a new file to llms/[model_name]_model.py which implements an inference function. From there, modify llm.py to include the appropriate model.

Adding new test cases

As discussed above, I've built a small dataflow DSL to implement test cases. This should let anyone add new test cases fairly easily. Let me repeat the initial test case examples in full:

TestHelloWorld = 'Print "hello world" in python' >> LLMRun() >> \
      ExtractCode() >> PythonRun() >> SubstringEvaluator("hello world")

Recall that you should read the >> operator as "and then do". If you're a bash person, it's like a | pipe. If you're a haskel person, it's like the $ operator.

We start by running the LLM we want to test with the prompt: Print "hello world" in python.
The model will probably give some code, but might also give an explanation or start by saying "Sure! I can answer your question." So we take whatever output came out of the model and pass it through a function to just extract the first code block.
We then actually go and run this python code, whatever it is. To be somewhat safe we do this by spawning a new docker env and run the code there.
And finally, we verify that the code was correct, and the output of the python contains a substring of "hello world".

This dataflow pipeline also lets you do more complex verification. Below is the same example as from the top where I evaluate a model's ability to understand how to draw .bmp images.

TestFlagDraw = "Write a C program that draws an american flag to stdout as a bmp." >> \
       ExtractCode() >> CRun() >> LLMVisionRun("What flag is shown in this image?") >> \
          (SubstringEvaluator("United States") | \
           SubstringEvaluator("USA") | \
           SubstringEvaluator("America"))

Here, after asking the model to draw the flag and running the resulting C code, I evaluate the model by asking another model what flag has been drawn, and checking if it says something like the US flag. Is this a perfect check? No. But verification is usually easier than generation, and so it's probably a good enough approximation of what I want.

Contributing

If you'd like to add your own tests to this benchmark feel free to open a PR! I'd be happy to accept basically anything interesting.

Adding new tests

There are only a few requirements for adding a test.

Test cases must be mechanistically verifiable. This is very limiting, I know. A whole lot of what I use LLMs for isn't verifiable in this way. Especially when I'm giving them large blocks of code and asking for specific changes that are hard to unit test. But in order for these to be useful your test must be easy to verify.
Test cases should complete quickly. I don't want to wait several minutes just for one test to run.
Tests should not be evaluated against LLMs during construction. Don't modify the test because the model gave an answer you didn't like. Most LLMs are stochastic enough that there is some way to elicit most behavior with enough trial and error. I want to see how the model answers with a human-written test, as they are normally asked, before LM refinement.
Tests should be designed so that passing demonstrates some interesting model capability. Making "gotcha" tests that are designed to show models fail in some way are not useful in this setup.
Test cases must not download large amounts of data from the internet. Someone else shouldn't have to pay for each run of this benchmark. If you need to test a library add it to the Dockerfile.

Fixing tests

Are there any tests here that are broken? I tried my best to make them all correct but can't guarantee correctness for sure. If so I'd be happy to accept fixes.

But please note: a broken test means one where the answer is objectively wrong. Like a test that says 6 is prime. A test that just expects a specific answer to an ambiguous question is not wrong. For example, one test asks "What do I do to fix AutoModel.from_pretrained to make it auto model with lm head" and expects the model to tell me that I should be using the class "AutoModelForCausalLM"; even though the class "AutoModelWithLMHead" exists, that's not what I was looking for.

I want to cite this in an academic paper

No you probably don't. At least, you probably don't if you're trying to compare why your new model is better or something. This is not meant to be something for academic papers and only evaluates a very specific set of capabilities. For all the reasons mentioned earlier I don't think this benchmark will accurately capture what academic people should care about for their models. Good for "useful for me?": yes. Good for "is my model better?": I don't think so. But I've now had at least a few people ask me about this who appear unswayed by the above argument.

So here's my answer: if you want to user this in a paper, then link to this github project AND INCLUDE THE GIT COMMIT HASH YOU USED. I make NO GUARANTEES that I won't just arbitrarily edit test cases without warning. In fact, it's already happened in #1! And #3! And #6. So if you want your paper to be at all scientific make sure to include the git commit hash.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

For Tasks:

Click tags to check more tools for each tasks

convert code decompile bytecode explain javascript identify encoding format write parsers

For Jobs:

software developer data scientist ai engineer machine learning researcher computer programmer

Alternative AI tools for yet-another-applied-llm-benchmark

Similar Open Source Tools

yet-another-applied-llm-benchmark

github

: 790

llm_engineering

LLM Engineering is an 8-week course designed to help learners master AI and LLMs through a series of projects that gradually increase in complexity. The course covers setting up the environment, working with APIs, using Google Colab for GPU processing, and building an autonomous Agentic AI solution. Learners are encouraged to actively participate, run code cells, tweak code, and share their progress with the community. The emphasis is on practical, educational projects that teach valuable business skills.

github

: 1.1k

AnnA_Anki_neuronal_Appendix

AnnA is a Python script designed to create filtered decks in optimal review order for Anki flashcards. It uses Machine Learning / AI to ensure semantically linked cards are reviewed far apart. The script helps users manage their daily reviews by creating special filtered decks that prioritize reviewing cards that are most different from the rest. It also allows users to reduce the number of daily reviews while increasing retention and automatically identifies semantic neighbors for each note.

github

: 59

dota2ai

The Dota2 AI Framework project aims to provide a framework for creating AI bots for Dota2, focusing on coordination and teamwork. It offers a LUA sandbox for scripting, allowing developers to code bots that can compete in standard matches. The project acts as a proxy between the game and a web service through JSON objects, enabling bots to perform actions like moving, attacking, casting spells, and buying items. It encourages contributions and aims to enhance the AI capabilities in Dota2 modding.

github

: 380

discourse-chatbot

The discourse-chatbot is an original AI chatbot for Discourse forums that allows users to converse with the bot in posts or chat channels. Users can customize the character of the bot, enable RAG mode for expert answers, search Wikipedia, news, and Google, provide market data, perform accurate math calculations, and experiment with vision support. The bot uses cutting-edge Open AI API and supports Azure and proxy server connections. It includes a quota system for access management and can be used in RAG mode or basic bot mode. The setup involves creating embeddings to make the bot aware of forum content and setting up bot access permissions based on trust levels. Users must obtain an API token from Open AI and configure group quotas to interact with the bot. The plugin is extensible to support other cloud bots and content search beyond the provided set.

github

: 68

iris-llm

iris-llm is a personal project aimed at creating an Intelligent Residential Integration System (IRIS) with a voice interface to local language models or GPT. It provides options for chat engines, text-to-speech engines, speech-to-text engines, feedback sounds, and push-to-talk or wake word features. The tool is still in early development and serves as a tutorial for Python coders interested in working with language models.

github

: 62

aiohomekit

aiohomekit is a Python library that implements the HomeKit protocol for controlling HomeKit accessories using asyncio. It is primarily used with Home Assistant, targeting the same versions of Python and following their code standards. The library is still under development and does not offer API guarantees yet. It aims to match the behavior of real HAP controllers, even when not strictly specified, and works around issues like JSON formatting, boolean encoding, header sensitivity, and TCP packet splitting. aiohomekit is primarily tested with Phillips Hue and Eve Extend bridges via Home Assistant, but is known to work with many more devices. It does not support BLE accessories and is intended for client-side use only.

github

: 57

WeeaBlind

Weeablind is a program that uses modern AI speech synthesis, diarization, language identification, and voice cloning to dub multi-lingual media and anime. It aims to create a pleasant alternative for folks facing accessibility hurdles such as blindness, dyslexia, learning disabilities, or simply those that don't enjoy reading subtitles. The program relies on state-of-the-art technologies such as ffmpeg, pydub, Coqui TTS, speechbrain, and pyannote.audio to analyze and synthesize speech that stays in-line with the source video file. Users have the option of dubbing every subtitle in the video, setting the start and end times, dubbing only foreign-language content, or full-blown multi-speaker dubbing with speaking rate and volume matching.

github

: 168

local-chat

LocalChat is a simple, easy-to-set-up, and open-source local AI chat tool that allows users to interact with generative language models on their own computers without transmitting data to a cloud server. It provides a chat-like interface for users to experience ChatGPT-like behavior locally, ensuring GDPR compliance and data privacy. Users can download LocalChat for macOS, Windows, or Linux to chat with open-weight generative language models.

github

: 86

ClipboardConqueror

Clipboard Conqueror is a multi-platform omnipresent copilot alternative. Currently requiring a kobold united or openAI compatible back end, this software brings powerful LLM based tools to any text field, the universal copilot you deserve. It simply works anywhere. No need to sign in, no required key. Provided you are using local AI, CC is a data secure alternative integration provided you trust whatever backend you use. *Special thank you to the creators of KoboldAi, KoboldCPP, llamma, openAi, and the communities that made all this possible to figure out.

github

: 371

oobabooga-macOS

github

: 55

bidirectional_streaming_ai_voice

This repository contains Python scripts that enable two-way voice conversations with Anthropic Claude, utilizing ElevenLabs for text-to-speech, Faster-Whisper for speech-to-text, and Pygame for audio playback. The tool operates by transcribing human audio using Faster-Whisper, sending the transcription to Anthropic Claude for response generation, and converting the LLM's response into audio using ElevenLabs. The audio is then played back through Pygame, allowing for a seamless and interactive conversation between the user and the AI. The repository includes variations of the main script to support different operating systems and configurations, such as using CPU transcription on Linux or employing the AssemblyAI API instead of Faster-Whisper.

github

: 95

femtoGPT

femtoGPT is a pure Rust implementation of a minimal Generative Pretrained Transformer. It can be used for both inference and training of GPT-style language models using CPUs and GPUs. The tool is implemented from scratch, including tensor processing logic and training/inference code of a minimal GPT architecture. It is a great start for those fascinated by LLMs and wanting to understand how these models work at deep levels. The tool uses random generation libraries, data-serialization libraries, and a parallel computing library. It is relatively fast on CPU and correctness of gradients is checked using the gradient-check method.

github

: 758

Web-LLM-Assistant-Llama-cpp

Web-LLM Assistant is a simple web search assistant that leverages a large language model (LLM) running via Llama.cpp to provide informative and context-aware responses to user queries. It combines the power of LLMs with real-time web searching capabilities, allowing it to access up-to-date information and synthesize comprehensive answers. The tool performs web searches, collects and scrapes information from search results, refines search queries, and provides answers based on the acquired information. Users can interact with the tool by asking questions or requesting web searches, making it a valuable resource for obtaining information beyond the LLM's training data.

github

: 187

modelbench

ModelBench is a tool for running safety benchmarks against AI models and generating detailed reports. It is part of the MLCommons project and is designed as a proof of concept to aggregate measures, relate them to specific harms, create benchmarks, and produce reports. The tool requires LlamaGuard for evaluating responses and a TogetherAI account for running benchmarks. Users can install ModelBench from GitHub or PyPI, run tests using Poetry, and create benchmarks by providing necessary API keys. The tool generates static HTML pages displaying benchmark scores and allows users to dump raw scores and manage cache for faster runs. ModelBench is aimed at enabling users to test their own models and create tests and benchmarks.

github

: 84

AI-HF_Patch

AI-HF_Patch is a comprehensive patch for AI-Shoujo that includes all free updates, fan-made English translations, essential mods, and gameplay improvements. It ensures compatibility with character cards and scenes while maintaining the original game's feel. The patch addresses common issues and provides uncensoring options. Users can support development through Patreon. The patch does not include the full game or pirated content, requiring a separate purchase from Steam. Installation is straightforward, with detailed guides available for users.

github

: 242

For similar tasks

ivy

Ivy is an open-source machine learning framework that enables you to: * 🔄 **Convert code into any framework** : Use and build on top of any model, library, or device by converting any code from one framework to another using `ivy.transpile`. * ⚒️ **Write framework-agnostic code** : Write your code once in `ivy` and then choose the most appropriate ML framework as the backend to leverage all the benefits and tools. Join our growing community 🌍 to connect with people using Ivy. **Let's** unify.ai **together 🦾**

github

: 14.0k

yet-another-applied-llm-benchmark

github

: 790

aide

Aide is a Visual Studio Code extension that offers AI-powered features to help users master any code. It provides functionalities such as code conversion between languages, code annotation for readability, quick copying of files/folders as AI prompts, executing custom AI commands, defining prompt templates, multi-file support, setting keyboard shortcuts, and more. Users can enhance their productivity and coding experience by leveraging Aide's intelligent capabilities.

github

: 2.5k

For similar jobs

sweep

Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

github

: 7.1k

teams-ai

The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

github

: 502

ai-guide

This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

github

: 159

classifai

Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

github

: 620

chatbot-ui

Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

github

: 27.7k

BricksLLM

BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

github

: 953

uAgents

uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

github

: 1.3k

griptape

Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.

github

: 2.2k