llm-structured-output-benchmarks
Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on tasks like multi-label classification, named entity recognition, synthetic data generation, etc.
Stars: 111
Benchmark various LLM Structured Output frameworks like Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, LMFormatEnforcer, etc on tasks like multi-label classification, named entity recognition, synthetic data generation. The tool provides benchmark results, methodology, instructions to run the benchmark, add new data, and add a new framework. It also includes a roadmap for framework-related tasks, contribution guidelines, citation information, and feedback request.
README:
Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, LMFormatEnforcer, etc on tasks like multi-label classification, named entity recognition, synthetic data generation, etc.
- Multi-label classification
Framework Model Reliability Latency p95 (s) Fructose gpt-4o-mini-2024-07-18 1.000 1.138 Modelsmith gpt-4o-mini-2024-07-18 1.000 1.184 OpenAI Structured Output gpt-4o-mini-2024-07-18 1.000 1.201 Instructor gpt-4o-mini-2024-07-18 1.000 1.206 Outlines unsloth/llama-3-8b-Instruct-bnb-4bit 1.000 1.804* LMFormatEnforcer unsloth/llama-3-8b-Instruct-bnb-4bit 1.000 3.649* Llamaindex gpt-4o-mini-2024-07-18 0.996 0.853 Marvin gpt-4o-mini-2024-07-18 0.988 1.338 Mirascope gpt-4o-mini-2024-07-18 0.985 1.531 - Named Entity Recognition
Framework Model Reliability Latency p95 (s) Precision Recall F1 Score OpenAI Structured Output gpt-4o-mini-2024-07-18 1.000 3.459 0.834 0.748 0.789 LMFormatEnforcer unsloth/llama-3-8b-Instruct-bnb-4bit 1.000 6.573* 0.701 0.262 0.382 Instructor gpt-4o-mini-2024-07-18 0.998 2.438 0.776 0.768 0.772 Mirascope gpt-4o-mini-2024-07-18 0.989 3.879 0.768 0.738 0.752 Llamaindex gpt-4o-mini-2024-07-18 0.979 5.771 0.792 0.310 0.446 Marvin gpt-4o-mini-2024-07-18 0.979 3.270 0.822 0.776 0.798 - Synthetic Data Generation
Framework Model Reliability Latency p95 (s) Variety Instructor gpt-4o-mini-2024-07-18 1.000 1.923 0.750 Marvin gpt-4o-mini-2024-07-18 1.000 1.496 0.010 Llamaindex gpt-4o-mini-2024-07-18 1.000 1.003 0.020 Modelsmith gpt-4o-mini-2024-07-18 0.970 2.324 0.835 Mirascope gpt-4o-mini-2024-07-18 0.790 3.383 0.886 Outlines unsloth/llama-3-8b-Instruct-bnb-4bit 0.690 2.354* 0.942 OpenAI Structured Output gpt-4o-mini-2024-07-18 0.650 1.431 0.877 LMFormatEnforcer unsloth/llama-3-8b-Instruct-bnb-4bit 0.650 2.561* 0.662
* NVIDIA GeForce RTX 4080 Super GPU
- Install the requirements using
pip install -r requirements.txt
- Set the OpenAI api key:
export OPENAI_API_KEY=sk-...
- Run the benchmark using
python -m main run-benchmark
- Raw results are stored in the
results
directory. - Generate the results using:
- Multilabel classification:
python -m main generate-results
- NER:
python -m main generate-results --task ner
- Synthetic data generation:
python -m main generate-results --task synthetic_data_generation
- Multilabel classification:
- To get help on the command line arguments, add
--help
after the command. Eg.,python -m main run-benchmark --help
- Multi-label classification:
- Task: Given a text, predict the labels associated with it.
-
Data:
- Base data: Alexa intent detection dataset
- Benchmarking test is run using synthetic data generated by running:
python -m data_sources.generate_dataset generate-multilabel-data
. - The synthetic data is generated by sampling and combining rows from the base data to achieve multiple classes per row according to some distribution for num classes per row. See
python -m data_sources.generate_dataset generate-multilabel-data --help
for more details.
-
Prompt:
"Classify the following text: {text}"
-
Evaluation Metrics:
- Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows
percent_successful
values. - Latency: The 95th percentile of the time taken to run the framework on the data.
- Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows
-
Experiment Details: Run each row through the framework
n_runs
number of times and log the percent of successful runs for each row.
- Named Entity Recognition
- Task: Given a text, extract the entities present in it.
-
Data:
- Base data: Synthetic PII Finance dataset
- Benchmarking test is run using a sampled data generated by running:
python -m data_sources.generate_dataset generate-ner-data
. - The data is sampled from the base data to achieve number of entities per row according to some distribution. See
python -m data_sources.generate_dataset generate-ner-data --help
for more details.
-
Prompt:
Extract and resolve a list of entities from the following text: {text}
-
Evaluation Metrics:
- Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows
percent_successful
values. - Latency: The 95th percentile of the time taken to run the framework on the data.
- Precision: The micro average of the precision of the framework on the data.
- Recall: The micro average of the recall of the framework on the data.
- F1 Score: The micro average of the F1 score of the framework on the data.
- Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows
-
Experiment Details: Run each row through the framework
n_runs
number of times and log the percent of successful runs for each row.
- Synthetic Data Generation
- Task: Generate synthetic data similar according to a Pydantic data model schema.
-
Data:
- Two level nested User details Pydantic schema.
-
Prompt:
Generate a random person's information. The name must be chosen at random. Make it something you wouldn't normally choose.
-
Evaluation Metrics:
- Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows
percent_successful
values. - Latency: The 95th percentile of the time taken to run the framework on the data.
- Variety: The percent of names that are unique compared to all names generated.
- Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows
-
Experiment Details: Run each row through the framework
n_runs
number of times and log the percent of successful runs.
- Create a new pandas dataframe pickle file with the following columns:
-
text
: The text to be sent to the framework -
labels
: List of labels associated with the text - See
data/multilabel_classification.pkl
for an example.
-
- Add the path to the new pickle file in the
./config.yaml
file under thesource_data_pickle_path
key for all the frameworks you want to test. - Run the benchmark using
python -m main run-benchmark
to test the new data on all the frameworks! - Generate the results using
python -m main generate-results
The easiest way to create a new framework is to reference the ./frameworks/instructor_framework.py
file. Detailed steps are as follows:
- Create a .py file in frameworks directory with the name of the framework. Eg.,
instructor_framework.py
for the instructor framework. - In this .py file create a class that inherits
BaseFramework
fromframeworks.base
. - The class should define an
init
method that initializes the base class. Here are the arguments the base class expects:-
task
(str): the task that the framework is being tested on. Obtained from./config.yaml
file. Allowed values are"multilabel_classification"
and"ner"
-
prompt
(str): Prompt template used. Obtained from theinit_kwargs
in the./config.yaml
file. -
llm_model
(str): LLM model to be used. Obtained from theinit_kwargs
in the./config.yaml
file. -
llm_model_family
(str): LLM model family to be used. Current supported values as"openai"
and"transformers"
. Obtained from theinit_kwargs
in the./config.yaml
file. -
retries
(int): Number of retries for the framework. Default is $0$. Obtained from theinit_kwargs
in the./config.yaml
file. -
source_data_picke_path
(str): Path to the source data pickle file. Obtained from theinit_kwargs
in the./config.yaml
file. -
sample_rows
(int): Number of rows to sample from the source data. Useful for testing on a smaller subset of data. Default is $0$ which uses all rows in source_data_pickle_path for the benchmarking. Obtained from theinit_kwargs
in the./config.yaml
file. -
response_model
(Any): The response model to be used. Internally passed by the benchmarking script.
-
- The class should define a
run
method that takes three arguments:-
task
: The task that the framework is being tested on. Obtained from thetask
in the./config.yaml
file. Eg.,"multilabel_classification"
-
n_runs
: number of times to repeat each text -
expected_response
: Output expected from the framework. Use default value ofNone
-
inputs
: a dictionary of{"text": str}
wherestr
is the text to be sent to the framework. Use default value of empty dictionary{}
-
- This
run
method should create anotherrun_experiment
function that takesinputs
as argument, runs that input through the framework and returns the output. - The
run_experiment
function should be annotated with the@experiment
decorator fromframeworks.base
withn_runs
,expected_resposne
andtask
as arguments. - The
run
method should call therun_experiment
function and return the four outputspredictions
,percent_successful
,metrics
andlatencies
. - Import this new class in
frameworks/__init__.py
. - Add a new entry in the
./config.yaml
file with the name of the class as the key. The yaml entry can have the following fields-
task
: the task that the framework is being tested on. Obtained from./config.yaml
file. Allowed values are"multilabel_classification"
and"ner"
-
n_runs
: number of times to repeat each text -
init_kwargs
: all the arguments that need to be passed to theinit
method of the class, including those mentioned in step 3 above.
-
- Framework related tasks:
Framework Multi-label classification Named Entity Recognition Synthetic Data Generation OpenAI Structured Output โ OpenAI โ OpenAI โ OpenAI Instructor โ OpenAI โ OpenAI โ OpenAI Mirascope โ OpenAI โ OpenAI โ OpenAI Fructose โ OpenAI ๐ง In Progress ๐ง In Progress Marvin โ OpenAI โ OpenAI โ OpenAI Llamaindex โ OpenAI โ OpenAI โ OpenAI Modelsmith โ OpenAI ๐ง In Progress โ OpenAI Outlines โ HF Transformers ๐ง In Progress โ HF Transformers LM format enforcer โ HF Transformers โ HF Transformers โ HF Transformers Jsonformer โ No Enum Support ๐ญ Planning ๐ญ Planning Strictjson โ Non-standard schema โ Non-standard schema โ Non-standard schema Guidance ๐ญ Planning ๐ญ Planning ๐ญ Planning DsPy ๐ญ Planning ๐ญ Planning ๐ญ Planning Langchain ๐ญ Planning ๐ญ Planning ๐ญ Planning - Others
- [x] Latency metrics
- [ ] CICD pipeline for benchmark run automation
- [ ] Async run
Contributions are welcome! Here are the steps to contribute:
- Please open an issue with any new framework you would like to add. This will help avoid duplication of effort.
- Once the issue is assigned to you, pls submit a PR with the new framework!
To cite LLM Structured Output Benchmarks in your work, please use the following bibtex reference:
@software{marie_stephen_leo_2024_12327267,
author = {Marie Stephen Leo},
title = {{stephenleo/llm-structured-output-benchmarks:
Release for Zenodo}},
month = jun,
year = 2024,
publisher = {Zenodo},
version = {v0.0.1},
doi = {10.5281/zenodo.12327267},
url = {https://doi.org/10.5281/zenodo.12327267}
}
If this work helped you in any way, please consider โญ this repository to give me feedback so I can spend more time on this project.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for llm-structured-output-benchmarks
Similar Open Source Tools
llm-structured-output-benchmarks
Benchmark various LLM Structured Output frameworks like Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, LMFormatEnforcer, etc on tasks like multi-label classification, named entity recognition, synthetic data generation. The tool provides benchmark results, methodology, instructions to run the benchmark, add new data, and add a new framework. It also includes a roadmap for framework-related tasks, contribution guidelines, citation information, and feedback request.
Large-Language-Models-play-StarCraftII
Large Language Models Play StarCraft II is a project that explores the capabilities of large language models (LLMs) in playing the game StarCraft II. The project introduces TextStarCraft II, a textual environment for the game, and a Chain of Summarization method for analyzing game information and making strategic decisions. Through experiments, the project demonstrates that LLM agents can defeat the built-in AI at a challenging difficulty level. The project provides benchmarks and a summarization approach to enhance strategic planning and interpretability in StarCraft II gameplay.
StableToolBench
StableToolBench is a new benchmark developed to address the instability of Tool Learning benchmarks. It aims to balance stability and reality by introducing features like Virtual API System, Solvable Queries, and Stable Evaluation System. The benchmark ensures consistency through a caching system and API simulators, filters queries based on solvability using LLMs, and evaluates model performance using GPT-4 with metrics like Solvable Pass Rate and Solvable Win Rate.
optillm
optillm is an OpenAI API compatible optimizing inference proxy implementing state-of-the-art techniques to enhance accuracy and performance of LLMs, focusing on reasoning over coding, logical, and mathematical queries. By leveraging additional compute at inference time, it surpasses frontier models across diverse tasks.
StableToolBench
StableToolBench is a new benchmark developed to address the instability of Tool Learning benchmarks. It aims to balance stability and reality by introducing features such as a Virtual API System with caching and API simulators, a new set of solvable queries determined by LLMs, and a Stable Evaluation System using GPT-4. The Virtual API Server can be set up either by building from source or using a prebuilt Docker image. Users can test the server using provided scripts and evaluate models with Solvable Pass Rate and Solvable Win Rate metrics. The tool also includes model experiments results comparing different models' performance.
TPI-LLM
TPI-LLM (Tensor Parallelism Inference for Large Language Models) is a system designed to bring LLM functions to low-resource edge devices, addressing privacy concerns by enabling LLM inference on edge devices with limited resources. It leverages multiple edge devices for inference through tensor parallelism and a sliding window memory scheduler to minimize memory usage. TPI-LLM demonstrates significant improvements in TTFT and token latency compared to other models, and plans to support infinitely large models with low token latency in the future.
TableLLM
TableLLM is a large language model designed for efficient tabular data manipulation tasks in real office scenarios. It can generate code solutions or direct text answers for tasks like insert, delete, update, query, merge, and chart operations on tables embedded in spreadsheets or documents. The model has been fine-tuned based on CodeLlama-7B and 13B, offering two scales: TableLLM-7B and TableLLM-13B. Evaluation results show its performance on benchmarks like WikiSQL, Spider, and self-created table operation benchmark. Users can use TableLLM for code and text generation tasks on tabular data.
ollama-operator
Ollama Operator is a Kubernetes operator designed to facilitate running large language models on Kubernetes clusters. It simplifies the process of deploying and managing multiple models on the same cluster, providing an easy-to-use interface for users. With support for various Kubernetes environments and seamless integration with Ollama models, APIs, and CLI, Ollama Operator streamlines the deployment and management of language models. By leveraging the capabilities of lama.cpp, Ollama Operator eliminates the need to worry about Python environments and CUDA drivers, making it a reliable tool for running large language models on Kubernetes.
AiOS
AiOS is a tool for human pose and shape estimation, performing human localization and SMPL-X estimation in a progressive manner. It consists of body localization, body refinement, and whole-body refinement stages. Users can download datasets for evaluation, SMPL-X body models, and AiOS checkpoint. Installation involves creating a conda virtual environment, installing PyTorch, torchvision, Pytorch3D, MMCV, and other dependencies. Inference requires placing the video for inference and pretrained models in specific directories. Test results are provided for NMVE, NMJE, MVE, and MPJPE on datasets like BEDLAM and AGORA. Users can run scripts for AGORA validation, AGORA test leaderboard, and BEDLAM leaderboard. The tool acknowledges codes from MMHuman3D, ED-Pose, and SMPLer-X.
GenAIComps
GenAIComps is an initiative aimed at building enterprise-grade Generative AI applications using a microservice architecture. It simplifies the scaling and deployment process for production, abstracting away infrastructure complexities. GenAIComps provides a suite of containerized microservices that can be assembled into a mega-service tailored for real-world Enterprise AI applications. The modular approach of microservices allows for independent development, deployment, and scaling of individual components, promoting modularity, flexibility, and scalability. The mega-service orchestrates multiple microservices to deliver comprehensive solutions, encapsulating complex business logic and workflow orchestration. The gateway serves as the interface for users to access the mega-service, providing customized access based on user requirements.
rwkv.cpp
rwkv.cpp is a port of BlinkDL/RWKV-LM to ggerganov/ggml, supporting FP32, FP16, and quantized INT4, INT5, and INT8 inference. It focuses on CPU but also supports cuBLAS. The project provides a C library rwkv.h and a Python wrapper. RWKV is a large language model architecture with models like RWKV v5 and v6. It requires only state from the previous step for calculations, making it CPU-friendly on large context lengths. Users are advised to test all available formats for perplexity and latency on a representative dataset before serious use.
RobustVLM
This repository contains code for the paper 'Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models'. It focuses on fine-tuning CLIP in an unsupervised manner to enhance its robustness against visual adversarial attacks. By replacing the vision encoder of large vision-language models with the fine-tuned CLIP models, it achieves state-of-the-art adversarial robustness on various vision-language tasks. The repository provides adversarially fine-tuned ViT-L/14 CLIP models and offers insights into zero-shot classification settings and clean accuracy improvements.
KwaiAgents
KwaiAgents is a series of Agent-related works open-sourced by the [KwaiKEG](https://github.com/KwaiKEG) from [Kuaishou Technology](https://www.kuaishou.com/en). The open-sourced content includes: 1. **KAgentSys-Lite**: a lite version of the KAgentSys in the paper. While retaining some of the original system's functionality, KAgentSys-Lite has certain differences and limitations when compared to its full-featured counterpart, such as: (1) a more limited set of tools; (2) a lack of memory mechanisms; (3) slightly reduced performance capabilities; and (4) a different codebase, as it evolves from open-source projects like BabyAGI and Auto-GPT. Despite these modifications, KAgentSys-Lite still delivers comparable performance among numerous open-source Agent systems available. 2. **KAgentLMs**: a series of large language models with agent capabilities such as planning, reflection, and tool-use, acquired through the Meta-agent tuning proposed in the paper. 3. **KAgentInstruct**: over 200k Agent-related instructions finetuning data (partially human-edited) proposed in the paper. 4. **KAgentBench**: over 3,000 human-edited, automated evaluation data for testing Agent capabilities, with evaluation dimensions including planning, tool-use, reflection, concluding, and profiling.
aikit
AIKit is a one-stop shop to quickly get started to host, deploy, build and fine-tune large language models (LLMs). AIKit offers two main capabilities: Inference: AIKit uses LocalAI, which supports a wide range of inference capabilities and formats. LocalAI provides a drop-in replacement REST API that is OpenAI API compatible, so you can use any OpenAI API compatible client, such as Kubectl AI, Chatbot-UI and many more, to send requests to open-source LLMs! Fine Tuning: AIKit offers an extensible fine tuning interface. It supports Unsloth for fast, memory efficient, and easy fine-tuning experience.
clearml-fractional-gpu
ClearML Fractional GPU is a tool designed to optimize GPU resource utilization by allowing multiple containers to run on the same GPU with driver-level memory limitation and compute time-slicing. It supports CUDA 11.x & CUDA 12.x, preventing greedy processes from grabbing the entire GPU memory. The tool offers options like Dynamic GPU Slicing, Container-based Memory Limits, and Kubernetes-based Static MIG Slicing to enhance hardware utilization and workload performance for AI development.
qserve
QServe is a serving system designed for efficient and accurate Large Language Models (LLM) on GPUs with W4A8KV4 quantization. It achieves higher throughput compared to leading industry solutions, allowing users to achieve A100-level throughput on cheaper L40S GPUs. The system introduces the QoQ quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache, addressing runtime overhead challenges. QServe improves serving throughput for various LLM models by implementing compute-aware weight reordering, register-level parallelism, and fused attention memory-bound techniques.
For similar tasks
llm-structured-output-benchmarks
Benchmark various LLM Structured Output frameworks like Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, LMFormatEnforcer, etc on tasks like multi-label classification, named entity recognition, synthetic data generation. The tool provides benchmark results, methodology, instructions to run the benchmark, add new data, and add a new framework. It also includes a roadmap for framework-related tasks, contribution guidelines, citation information, and feedback request.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.