llm-structured-output-benchmarks

Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on tasks like multi-label classification, named entity recognition, synthetic data generation, etc.

Stars: 111

Visit

Benchmark various LLM Structured Output frameworks like Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, LMFormatEnforcer, etc on tasks like multi-label classification, named entity recognition, synthetic data generation. The tool provides benchmark results, methodology, instructions to run the benchmark, add new data, and add a new framework. It also includes a roadmap for framework-related tasks, contribution guidelines, citation information, and feedback request.

README:

🧩 LLM Structured Output Benchmarks

Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, LMFormatEnforcer, etc on tasks like multi-label classification, named entity recognition, synthetic data generation, etc.

🏆 Benchmark Results [2024-08-25]

Multi-label classification

Framework	Model	Reliability	Latency p95 (s)
Fructose	gpt-4o-mini-2024-07-18	1.000	1.138
Modelsmith	gpt-4o-mini-2024-07-18	1.000	1.184
OpenAI Structured Output	gpt-4o-mini-2024-07-18	1.000	1.201
Instructor	gpt-4o-mini-2024-07-18	1.000	1.206
Outlines	unsloth/llama-3-8b-Instruct-bnb-4bit	1.000	1.804^*
LMFormatEnforcer	unsloth/llama-3-8b-Instruct-bnb-4bit	1.000	3.649^*
Llamaindex	gpt-4o-mini-2024-07-18	0.996	0.853
Marvin	gpt-4o-mini-2024-07-18	0.988	1.338
Mirascope	gpt-4o-mini-2024-07-18	0.985	1.531

Named Entity Recognition

Framework	Model	Reliability	Latency p95 (s)	Precision	Recall	F1 Score
OpenAI Structured Output	gpt-4o-mini-2024-07-18	1.000	3.459	0.834	0.748	0.789
LMFormatEnforcer	unsloth/llama-3-8b-Instruct-bnb-4bit	1.000	6.573^*	0.701	0.262	0.382
Instructor	gpt-4o-mini-2024-07-18	0.998	2.438	0.776	0.768	0.772
Mirascope	gpt-4o-mini-2024-07-18	0.989	3.879	0.768	0.738	0.752
Llamaindex	gpt-4o-mini-2024-07-18	0.979	5.771	0.792	0.310	0.446
Marvin	gpt-4o-mini-2024-07-18	0.979	3.270	0.822	0.776	0.798

Synthetic Data Generation

Framework	Model	Reliability	Latency p95 (s)	Variety
Instructor	gpt-4o-mini-2024-07-18	1.000	1.923	0.750
Marvin	gpt-4o-mini-2024-07-18	1.000	1.496	0.010
Llamaindex	gpt-4o-mini-2024-07-18	1.000	1.003	0.020
Modelsmith	gpt-4o-mini-2024-07-18	0.970	2.324	0.835
Mirascope	gpt-4o-mini-2024-07-18	0.790	3.383	0.886
Outlines	unsloth/llama-3-8b-Instruct-bnb-4bit	0.690	2.354^*	0.942
OpenAI Structured Output	gpt-4o-mini-2024-07-18	0.650	1.431	0.877
LMFormatEnforcer	unsloth/llama-3-8b-Instruct-bnb-4bit	0.650	2.561^*	0.662

^* NVIDIA GeForce RTX 4080 Super GPU

🏃 Run the benchmark

Install the requirements using pip install -r requirements.txt
Set the OpenAI api key: export OPENAI_API_KEY=sk-...
Run the benchmark using python -m main run-benchmark
Raw results are stored in the results directory.
Generate the results using:
- Multilabel classification: python -m main generate-results
- NER: python -m main generate-results --task ner
- Synthetic data generation: python -m main generate-results --task synthetic_data_generation
To get help on the command line arguments, add --help after the command. Eg., python -m main run-benchmark --help

🧪 Benchmark methodology

Multi-label classification:
- Task: Given a text, predict the labels associated with it.
- Data:
  - Base data: Alexa intent detection dataset
  - Benchmarking test is run using synthetic data generated by running: python -m data_sources.generate_dataset generate-multilabel-data.
  - The synthetic data is generated by sampling and combining rows from the base data to achieve multiple classes per row according to some distribution for num classes per row. See python -m data_sources.generate_dataset generate-multilabel-data --help for more details.
- Prompt: "Classify the following text: {text}"
- Evaluation Metrics:
  1. Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows percent_successful values.
  2. Latency: The 95th percentile of the time taken to run the framework on the data.
- Experiment Details: Run each row through the framework n_runs number of times and log the percent of successful runs for each row.
Named Entity Recognition
- Task: Given a text, extract the entities present in it.
- Data:
  - Base data: Synthetic PII Finance dataset
  - Benchmarking test is run using a sampled data generated by running: python -m data_sources.generate_dataset generate-ner-data.
  - The data is sampled from the base data to achieve number of entities per row according to some distribution. See python -m data_sources.generate_dataset generate-ner-data --help for more details.
- Prompt: Extract and resolve a list of entities from the following text: {text}
- Evaluation Metrics:
  1. Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows percent_successful values.
  2. Latency: The 95th percentile of the time taken to run the framework on the data.
  3. Precision: The micro average of the precision of the framework on the data.
  4. Recall: The micro average of the recall of the framework on the data.
  5. F1 Score: The micro average of the F1 score of the framework on the data.
- Experiment Details: Run each row through the framework n_runs number of times and log the percent of successful runs for each row.
Synthetic Data Generation
- Task: Generate synthetic data similar according to a Pydantic data model schema.
- Data:
  - Two level nested User details Pydantic schema.
- Prompt: Generate a random person's information. The name must be chosen at random. Make it something you wouldn't normally choose.
- Evaluation Metrics:
  1. Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows percent_successful values.
  2. Latency: The 95th percentile of the time taken to run the framework on the data.
  3. Variety: The percent of names that are unique compared to all names generated.
- Experiment Details: Run each row through the framework n_runs number of times and log the percent of successful runs.

📊 Adding new data

Create a new pandas dataframe pickle file with the following columns:
- text: The text to be sent to the framework
- labels: List of labels associated with the text
- See data/multilabel_classification.pkl for an example.
Add the path to the new pickle file in the ./config.yaml file under the source_data_pickle_path key for all the frameworks you want to test.
Run the benchmark using python -m main run-benchmark to test the new data on all the frameworks!
Generate the results using python -m main generate-results

🏗️ Adding a new framework

The easiest way to create a new framework is to reference the ./frameworks/instructor_framework.py file. Detailed steps are as follows:

Create a .py file in frameworks directory with the name of the framework. Eg., instructor_framework.py for the instructor framework.
In this .py file create a class that inherits BaseFramework from frameworks.base.
The class should define an init method that initializes the base class. Here are the arguments the base class expects:
- task (str): the task that the framework is being tested on. Obtained from ./config.yaml file. Allowed values are "multilabel_classification" and "ner"
- prompt (str): Prompt template used. Obtained from the init_kwargs in the ./config.yaml file.
- llm_model (str): LLM model to be used. Obtained from the init_kwargs in the ./config.yaml file.
- llm_model_family (str): LLM model family to be used. Current supported values as "openai" and "transformers". Obtained from the init_kwargs in the ./config.yaml file.
- retries (int): Number of retries for the framework. Default is $0$. Obtained from the init_kwargs in the ./config.yaml file.
- source_data_picke_path (str): Path to the source data pickle file. Obtained from the init_kwargs in the ./config.yaml file.
- sample_rows (int): Number of rows to sample from the source data. Useful for testing on a smaller subset of data. Default is $0$ which uses all rows in source_data_pickle_path for the benchmarking. Obtained from the init_kwargs in the ./config.yaml file.
- response_model (Any): The response model to be used. Internally passed by the benchmarking script.
The class should define a run method that takes three arguments:
- task: The task that the framework is being tested on. Obtained from the task in the ./config.yaml file. Eg., "multilabel_classification"
- n_runs: number of times to repeat each text
- expected_response: Output expected from the framework. Use default value of None
- inputs: a dictionary of {"text": str} where str is the text to be sent to the framework. Use default value of empty dictionary {}
This run method should create another run_experiment function that takes inputs as argument, runs that input through the framework and returns the output.
The run_experiment function should be annotated with the @experiment decorator from frameworks.base with n_runs, expected_resposne and task as arguments.
The run method should call the run_experiment function and return the four outputs predictions, percent_successful, metrics and latencies.
Import this new class in frameworks/__init__.py.
Add a new entry in the ./config.yaml file with the name of the class as the key. The yaml entry can have the following fields
- task: the task that the framework is being tested on. Obtained from ./config.yaml file. Allowed values are "multilabel_classification" and "ner"
- n_runs: number of times to repeat each text
- init_kwargs: all the arguments that need to be passed to the init method of the class, including those mentioned in step 3 above.

🧭 Roadmap

Framework related tasks:

Framework	Multi-label classification	Named Entity Recognition	Synthetic Data Generation
OpenAI Structured Output	✅ OpenAI	✅ OpenAI	✅ OpenAI
Instructor	✅ OpenAI	✅ OpenAI	✅ OpenAI
Mirascope	✅ OpenAI	✅ OpenAI	✅ OpenAI
Fructose	✅ OpenAI	🚧 In Progress	🚧 In Progress
Marvin	✅ OpenAI	✅ OpenAI	✅ OpenAI
Llamaindex	✅ OpenAI	✅ OpenAI	✅ OpenAI
Modelsmith	✅ OpenAI	🚧 In Progress	✅ OpenAI
Outlines	✅ HF Transformers	🚧 In Progress	✅ HF Transformers
LM format enforcer	✅ HF Transformers	✅ HF Transformers	✅ HF Transformers
Jsonformer	❌ No Enum Support	💭 Planning	💭 Planning
Strictjson	❌ Non-standard schema	❌ Non-standard schema	❌ Non-standard schema
Guidance	💭 Planning	💭 Planning	💭 Planning
DsPy	💭 Planning	💭 Planning	💭 Planning
Langchain	💭 Planning	💭 Planning	💭 Planning

Others
- [x] Latency metrics
- [ ] CICD pipeline for benchmark run automation
- [ ] Async run

💡 Contribution guidelines

Contributions are welcome! Here are the steps to contribute:

Please open an issue with any new framework you would like to add. This will help avoid duplication of effort.
Once the issue is assigned to you, pls submit a PR with the new framework!

🎓 Citation

To cite LLM Structured Output Benchmarks in your work, please use the following bibtex reference:

@software{marie_stephen_leo_2024_12327267,
  author       = {Marie Stephen Leo},
  title        = {{stephenleo/llm-structured-output-benchmarks: 
                   Release for Zenodo}},
  month        = jun,
  year         = 2024,
  publisher    = {Zenodo},
  version      = {v0.0.1},
  doi          = {10.5281/zenodo.12327267},
  url          = {https://doi.org/10.5281/zenodo.12327267}
}

🙏 Feedback

If this work helped you in any way, please consider ⭐ this repository to give me feedback so I can spend more time on this project.

For Tasks:

Click tags to check more tools for each tasks

run benchmark add new data add new framework contribute guidelines cite benchmarks

For Jobs:

data scientist machine learning engineer ai researcher software developer research scientist

Alternative AI tools for llm-structured-output-benchmarks

Similar Open Source Tools

llm-structured-output-benchmarks

github

: 111

optillm

optillm is an OpenAI API compatible optimizing inference proxy implementing state-of-the-art techniques to enhance accuracy and performance of LLMs, focusing on reasoning over coding, logical, and mathematical queries. By leveraging additional compute at inference time, it surpasses frontier models across diverse tasks.

github

: 2.1k

Large-Language-Models-play-StarCraftII

Large Language Models Play StarCraft II is a project that explores the capabilities of large language models (LLMs) in playing the game StarCraft II. The project introduces TextStarCraft II, a textual environment for the game, and a Chain of Summarization method for analyzing game information and making strategic decisions. Through experiments, the project demonstrates that LLM agents can defeat the built-in AI at a challenging difficulty level. The project provides benchmarks and a summarization approach to enhance strategic planning and interpretability in StarCraft II gameplay.

github

: 152

StableToolBench

StableToolBench is a new benchmark developed to address the instability of Tool Learning benchmarks. It aims to balance stability and reality by introducing features such as a Virtual API System with caching and API simulators, a new set of solvable queries determined by LLMs, and a Stable Evaluation System using GPT-4. The Virtual API Server can be set up either by building from source or using a prebuilt Docker image. Users can test the server using provided scripts and evaluate models with Solvable Pass Rate and Solvable Win Rate metrics. The tool also includes model experiments results comparing different models' performance.

github

: 59

StableToolBench

StableToolBench is a new benchmark developed to address the instability of Tool Learning benchmarks. It aims to balance stability and reality by introducing features like Virtual API System, Solvable Queries, and Stable Evaluation System. The benchmark ensures consistency through a caching system and API simulators, filters queries based on solvability using LLMs, and evaluates model performance using GPT-4 with metrics like Solvable Pass Rate and Solvable Win Rate.

github

: 135

AQLM

AQLM is the official PyTorch implementation for Extreme Compression of Large Language Models via Additive Quantization. It includes prequantized AQLM models without PV-Tuning and PV-Tuned models for LLaMA, Mistral, and Mixtral families. The repository provides inference examples, model details, and quantization setups. Users can run prequantized models using Google Colab examples, work with different model families, and install the necessary inference library. The repository also offers detailed instructions for quantization, fine-tuning, and model evaluation. AQLM quantization involves calibrating models for compression, and users can improve model accuracy through finetuning. Additionally, the repository includes information on preparing models for inference and contributing guidelines.

github

: 1.2k

TPI-LLM

TPI-LLM (Tensor Parallelism Inference for Large Language Models) is a system designed to bring LLM functions to low-resource edge devices, addressing privacy concerns by enabling LLM inference on edge devices with limited resources. It leverages multiple edge devices for inference through tensor parallelism and a sliding window memory scheduler to minimize memory usage. TPI-LLM demonstrates significant improvements in TTFT and token latency compared to other models, and plans to support infinitely large models with low token latency in the future.

github

: 123

factorio-learning-environment

Factorio Learning Environment is an open source framework designed for developing and evaluating LLM agents in the game of Factorio. It provides two settings: Lab-play with structured tasks and Open-play for building large factories. Results show limitations in spatial reasoning and automation strategies. Agents interact with the environment through code synthesis, observation, action, and feedback. Tools are provided for game actions and state representation. Agents operate in episodes with observation, planning, and action execution. Tasks specify agent goals and are implemented in JSON files. The project structure includes directories for agents, environment, cluster, data, docs, eval, and more. A database is used for checkpointing agent steps. Benchmarks show performance metrics for different configurations.

github

: 525

AiOS

AiOS is a tool for human pose and shape estimation, performing human localization and SMPL-X estimation in a progressive manner. It consists of body localization, body refinement, and whole-body refinement stages. Users can download datasets for evaluation, SMPL-X body models, and AiOS checkpoint. Installation involves creating a conda virtual environment, installing PyTorch, torchvision, Pytorch3D, MMCV, and other dependencies. Inference requires placing the video for inference and pretrained models in specific directories. Test results are provided for NMVE, NMJE, MVE, and MPJPE on datasets like BEDLAM and AGORA. Users can run scripts for AGORA validation, AGORA test leaderboard, and BEDLAM leaderboard. The tool acknowledges codes from MMHuman3D, ED-Pose, and SMPLer-X.

github

: 121

mistral.rs

Mistral.rs is a fast LLM inference platform written in Rust. We support inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.

github

: 5.4k

beet

Beet is a collection of crates for authoring and running web pages, games and AI behaviors. It includes crates like `beet_flow` for scenes-as-control-flow bevy library, `beet_spatial` for spatial behaviors, `beet_ml` for machine learning, `beet_sim` for simulation tooling, `beet_rsx` for authoring tools for html and bevy, and `beet_router` for file-based router for web docs. The `beet` crate acts as a base crate that re-exports sub-crates based on feature flags, similar to the `bevy` crate structure.

github

: 59

rwkv.cpp

rwkv.cpp is a port of BlinkDL/RWKV-LM to ggerganov/ggml, supporting FP32, FP16, and quantized INT4, INT5, and INT8 inference. It focuses on CPU but also supports cuBLAS. The project provides a C library rwkv.h and a Python wrapper. RWKV is a large language model architecture with models like RWKV v5 and v6. It requires only state from the previous step for calculations, making it CPU-friendly on large context lengths. Users are advised to test all available formats for perplexity and latency on a representative dataset before serious use.

github

: 1.1k

graphrag-visualizer

GraphRAG Visualizer is an application designed to visualize Microsoft GraphRAG artifacts by uploading parquet files generated from the GraphRAG indexing pipeline. Users can view and analyze data in 2D or 3D graphs, display data tables, search for specific nodes or relationships, and process artifacts locally for data security and privacy.

github

: 301

crabml

Crabml is a llama.cpp compatible AI inference engine written in Rust, designed for efficient inference on various platforms with WebGPU support. It focuses on running inference tasks with SIMD acceleration and minimal memory requirements, supporting multiple models and quantization methods. The project is hackable, embeddable, and aims to provide high-performance AI inference capabilities.

github

: 412

local-deep-research

Local Deep Research is a powerful AI-powered research assistant that performs deep, iterative analysis using multiple LLMs and web searches. It can be run locally for privacy or configured to use cloud-based LLMs for enhanced capabilities. The tool offers advanced research capabilities, flexible LLM support, rich output options, privacy-focused operation, enhanced search integration, and academic & scientific integration. It also provides a web interface, command line interface, and supports multiple LLM providers and search engines. Users can configure AI models, search engines, and research parameters for customized research experiences.

github

: 2.0k

llm_processes

github

: 55

For similar tasks

llm-structured-output-benchmarks

github

: 111

coze-js

Coze-js is a monorepo containing packages for Coze API and Realtime API. It provides usage examples for Node.js and React Web, as well as full console and sample call up demos. The tool requires Node.js 18+, pnpm 9.12.0, and Rush 5.140.0 for installation. Developers can start developing projects within the repository by following the provided steps. Each package in the monorepo can be developed and published independently, with documentation on contributing guidelines and publishing. The tool is licensed under MIT.

github

: 90

agentdojo

AgentDojo is a dynamic environment designed to evaluate prompt injection attacks and defenses for large language models (LLM) agents. It provides a benchmark script to run different suites and tasks with specified LLM models, defenses, and attacks. The tool is under active development, and users can inspect the results through dedicated documentation pages and the Invariant Benchmark Registry.

github

: 112

dlio_benchmark

DLIO is an I/O benchmark tool designed for Deep Learning applications. It emulates modern deep learning applications using Benchmark Runner, Data Generator, Format Handler, and I/O Profiler modules. Users can configure various I/O patterns, data loaders, data formats, datasets, and parameters. The tool is aimed at emulating the I/O behavior of deep learning applications and provides a modular design for flexibility and customization.

github

: 82

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 855

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.3k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 30.6k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675