evalplus
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023
Stars: 1085
EvalPlus is a rigorous evaluation framework for LLM4Code, providing HumanEval+ and MBPP+ tests to evaluate large language models on code generation tasks. It offers precise evaluation and ranking, coding rigorousness analysis, and pre-generated code samples. Users can use EvalPlus to generate code solutions, post-process code, and evaluate code quality. The tool includes tools for code generation and test input generation using various backends.
README:
π₯Quick Start β’ π»LLM code β’ π¨Tools β’ πCitation β’ πAcknowledgement
EvalPlus is a rigorous evaluation framework for LLM4Code, with:
- β¨ HumanEval+: 80x more tests than the original HumanEval!
- β¨ MBPP+: 35x more tests than the original MBPP!
- β¨ Evaluation framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.
Why EvalPlus?
- β¨ Precise evaluation & ranking: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
- β¨ Coding rigorousness: Look at the score differences! esp. before and after using EvalPlus tests! Less drop is better as it means more rigorousness and less laxity in code generation; while a big drop means the generated code tends to be fragile.
- β¨ Pre-generated samples: EvalPlus accelerates LLM4Code research by open-sourcing LLM-generated samples for various models -- no need to re-run the expensive benchmarks!
Want to know more details? Read our NeurIPS'23 paper as well as our Google Slides!
[!Important]
π§ MBPP+ update (
v0.1.0
tov0.2.0
): We recently improved and stablized MBPP+ dataset by removing some tasks whosetest_list
is wrong (brought by the original MBPP dataset itself) to make it more reasonable to solve. Inv0.1.0
MBPP+ has 399 tasks while the newv0.2.0
has 378 tasks. We also improved the oracle. Therefore, usingv0.2.0
you might expect ~4pp pass@1 improvement for both base and plus tests.
[!Tip]
EvalPlus β€οΈ bigcode-evaluation-harness! HumanEval+ and MBPP+ have been integrated to bigcode-evaluation-harness that you can also run EvalPlus datasets there!
To get started, please first setup the environment:
pip install evalplus --upgrade
β¬ Install nightly version :: click to expand ::
pip install "git+https://github.com/evalplus/evalplus.git" --upgrade
β¬ Using EvalPlus as a local repo? :: click to expand ::
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt
Implement the GEN_SOLUTION
function by calling the LLM to produce the complete solution (include the code) and save the samples to samples.jsonl
:
from evalplus.data import get_[human_eval|mbpp]_plus, write_jsonl
samples = [
dict(task_id=task_id, solution=GEN_SOLUTION(problem["prompt"]))
for task_id, problem in get_[human_eval|mbpp]_plus().items()
]
write_jsonl("samples.jsonl", samples)
π€ Structure of `problem`? :: click to expand ::
-
task_id
is the identifier string for the task -
entry_point
is name of the function -
prompt
is the function signature with docstring
-
canonical_solution
is the ground-truth implementation (re-implemented to fix bugs in HumanEval) -
base_input
is the test inputs in original HumanEval -
plus_input
is the test inputs brought by EvalPlus
[!Note]
Expected Schema of
samples.jsonl
task_id
: Task ID, which are the keys ofget_[human_eval|mbpp]_plus()
solution
(optional): Self-contained solution (usually including the prompt)
- Example:
{"task_id": "HumanEval/?", "solution": "def f():\n return 1"}
completion
(optional): Function body without prompt
- Example:
{"task_id": "HumanEval/?", "completion": " return 1"}
Only one of
solution
andcompletion
is required. If both are provided,solution
will be used. We also accept solutions in the form of directory, i.e.,--samples ${SAMPLE_DIR}
where${SAMPLE_DIR}
is organized as:${SAMPLE_DIR}/${TASK_ID}/{SAMPLE_ID}.py
(${TASK_ID} = task_id.replace("/", "_")
).
LLM-generated text may not be compilable code for including natural language lines or incomplete extra code.
We provide a tool namely evalplus.sanitize
to clean up the code:
# π‘ If you are storing codes in jsonl:
evalplus.sanitize --samples samples.jsonl
# Sanitized code will be produced to `samples-sanitized.jsonl`
# π‘ If you are storing codes in directories:
evalplus.sanitize --samples /path/to/vicuna-[??]b_temp_[??]
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`
π Checking the compilability of post-processed code:: click to expand ::
To double-check the post-processing results, you can use evalplus.syncheck
to check the code validity before and after sanitization, which will print erroneous code snippets and why they are wrong:
# π‘ If you are storing codes in jsonl:
evalplus.syncheck --samples samples.jsonl --dataset [humaneval|mbpp]
# π‘ If you are storing codes in directories:
evalplus.syncheck --samples /path/to/vicuna-[??]b_temp_[??] --dataset [humaneval|mbpp]
You are strongly recommended to use a sandbox such as docker:
docker run -v $(pwd):/app ganler/evalplus:latest --dataset [humaneval|mbpp] --samples samples.jsonl
...Or if you want to try it locally regardless of the risks
evalplus.evaluate --dataset [humaneval|mbpp] --samples samples.jsonl
To use a user-defined dataset locally, you can set HUMANEVAL_OVERRIDE_PATH
or MBPP_OVERRIDE_PATH
:
HUMANEVAL_OVERRIDE_PATH="/path/to/HumanEvalPlus.jsonl.gz" evalplus.evaluate --dataset humaneval --samples samples.jsonl
[!Tip]
Do you use a very slow machine?
LLM solutions are regarded as failed on timeout (and OOM etc.). Specifically, we set the timeout $T=\max(T_{base}, T_{gt}\times k)$, where:
- $T_{base}$ is the minimal timeout (configurable by
--min-time-limit
; default to 1s);- $T_{gt}$ is the runtime of the ground-truth solutions (achieved via profiling);
- $k$ is a configurable factor
--gt-time-limit-factor
(default to 4);If your machine is too slow and you are getting high-variance results, try to use larger $k$ and $T_{base}$.
Additionally, you are NOT encouraged to make your test-bed over stressed while running evaluation. For example, using
--parallel 64
on a 4-core machine or doing something else during evaluation are bad ideas...
π€ Evaluate with local GitHub repo? :: click to expand ::
export PYTHONPATH=$PYTHONPATH:$(pwd)
python evalplus/evaluate.py --dataset humaneval --samples samples.jsonl
β¨οΈ More command-line flags :: click to expand ::
-
--parallel
: by default half of the cores -
--base-only
(store_ture): only run base HumanEval tests -
--i-just-wanna-run
: force a re-run
The output should be like (below is GPT-4 greedy decoding example):
Computing expected output...
Expected outputs computed in 15.18s
Reading samples...
164it [00:04, 37.79it/s]
Evaluating samples...
100%|ββββββββββββββββββββββββββββββββββββββββββ| 164/164 [00:03<00:00, 44.75it/s]
Base
{'pass@1': 0.8841463414634146}
Base + Extra
{'pass@1': 0.768}
-
Base
is thepass@k
for the original HumanEval -
Base + Extra
is thepass@k
for the our HumanEval+ (with extra tests) - The "k" includes
[1, 10, 100]
where k values<=
the sample size will be used - A cache file named like
samples_eval_results.jsonl
will be cached. Remove it to re-run the evaluation
π€ How long it would take? :: click to expand ::
If you do greedy decoding where there is only one sample for each task, the evaluation should take just a few seconds.
When running 200 samples x 164 tasks x ~700+ tests, it can take around 2-10 minutes by using --parallel 64
and --test-details
.
Here are some tips to speed up the evaluation:
- Use
--parallel $(nproc)
- Do NOT use
--test-details
if you just want to quickly get pass@k as--test-details
will run all tests (700+ on average for each task), while without--test-details
the testing for a sample stops immediately when it fails the first test. - Use our pre-evaluated results (see LLM-generated code)
- Use HumanEval+ Mini
[!Tip]
π Try out
HumanEvalPlus-Mini
! which selects a minimal set of additional tests with the highest quality, achieving almost the same effectiveness of the full version. Just add a--mini
flag, it can run 23+% faster! (even faster if you evaluate all tests without fail-stop with--test-details
).docker run -v $(pwd):/app ganler/evalplus:latest --dataset humaneval --samples samples.jsonl --mini # ...Or locally β οΈ # evalplus.evaluate --dataset humaneval --samples samples.jsonl --mini
We also share pre-generated code samples from LLMs we have evaluated:
- HumanEval+: See the attachment of our v0.1.0 release.
- MBPP+: See the attachment of our v0.2.0 release.
Each sample file is packaged in a zip file named like ${model_name}_temp_${temperature}.zip
.
You can unzip them to a folder named like ${model_name}_temp_${temperature}
and run the evaluation from scratch with:
evalplus.evaluate --dataset humaneval --samples ${model_name}_temp_${temperature}
To use these tools, please first install the repository from GitHub:
git clone https://github.com/evalplus/evalplus.git
cd evalplus
pip install -r tools/requirements.txt
You can use codegen/generate.py
to performance code generation.
We currently support following backends:
-
vllm
: Set--model
as Hugging Face model ID such asmicrosoft/Phi-3-mini-128k-instruct
-
hf
: HuggingFace Transformers; same way to setup--model
-
openai
: ConfigureOPENAI_API_KEY
; one can configure--base-url
-
anthropic
: ConfigureANTHROPIC_API_KEY
-
mistral
: ConfigureMISTRAL_API_KEY
python codegen/generate.py --model "mistralai/Mistral-7B-Instruct-v0.2" --greedy --root [result_path] --dataset [mbpp|humaneval] --backend [vllm]
To perform code generation using user-defined tasks and datasets:
# Override HumanEval datasets
HUMANEVAL_OVERRIDE_PATH="/path/to/HumanEvalPlus.jsonl.gz" python codegen/generate.py --model "mistralai/Mistral-7B-Instruct-v0.2" --greedy --root [result_path] --dataset humaneval --backend [vllm|hf|openai]
# Override MBPP datasets
MBPP_OVERRIDE_PATH="/path/to/MbppPlus.jsonl.gz" python codegen/generate.py --model "mistralai/Mistral-7B-Instruct-v0.2" --greedy --root [result_path] --dataset mbpp --backend [vllm|hf|openai]
Please check evalplus/inputgen.py
.
@inproceedings{evalplus,
title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},
booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},
year = {2023},
url = {https://openreview.net/forum?id=1qvx610Cu7},
}
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for evalplus
Similar Open Source Tools
evalplus
EvalPlus is a rigorous evaluation framework for LLM4Code, providing HumanEval+ and MBPP+ tests to evaluate large language models on code generation tasks. It offers precise evaluation and ranking, coding rigorousness analysis, and pre-generated code samples. Users can use EvalPlus to generate code solutions, post-process code, and evaluate code quality. The tool includes tools for code generation and test input generation using various backends.
python-tgpt
Python-tgpt is a Python package that enables seamless interaction with over 45 free LLM providers without requiring an API key. It also provides image generation capabilities. The name _python-tgpt_ draws inspiration from its parent project tgpt, which operates on Golang. Through this Python adaptation, users can effortlessly engage with a number of free LLMs available, fostering a smoother AI interaction experience.
nano-graphrag
nano-GraphRAG is a simple, easy-to-hack implementation of GraphRAG that provides a smaller, faster, and cleaner version of the official implementation. It is about 800 lines of code, small yet scalable, asynchronous, and fully typed. The tool supports incremental insert, async methods, and various parameters for customization. Users can replace storage components and LLM functions as needed. It also allows for embedding function replacement and comes with pre-defined prompts for entity extraction and community reports. However, some features like covariates and global search implementation differ from the original GraphRAG. Future versions aim to address issues related to data source ID, community description truncation, and add new components.
k8sgpt
K8sGPT is a tool for scanning your Kubernetes clusters, diagnosing, and triaging issues in simple English. It has SRE experience codified into its analyzers and helps to pull out the most relevant information to enrich it with AI.
ChatDBG
ChatDBG is an AI-based debugging assistant for C/C++/Python/Rust code that integrates large language models into a standard debugger (`pdb`, `lldb`, `gdb`, and `windbg`) to help debug your code. With ChatDBG, you can engage in a dialog with your debugger, asking open-ended questions about your program, like `why is x null?`. ChatDBG will _take the wheel_ and steer the debugger to answer your queries. ChatDBG can provide error diagnoses and suggest fixes. As far as we are aware, ChatDBG is the _first_ debugger to automatically perform root cause analysis and to provide suggested fixes.
aim
Aim is a command-line tool for downloading and uploading files with resume support. It supports various protocols including HTTP, FTP, SFTP, SSH, and S3. Aim features an interactive mode for easy navigation and selection of files, as well as the ability to share folders over HTTP for easy access from other devices. Additionally, it offers customizable progress indicators and output formats, and can be integrated with other commands through piping. Aim can be installed via pre-built binaries or by compiling from source, and is also available as a Docker image for platform-independent usage.
ChatSim
ChatSim is a tool designed for editable scene simulation for autonomous driving via LLM-Agent collaboration. It provides functionalities for setting up the environment, installing necessary dependencies like McNeRF and Inpainting tools, and preparing data for simulation. Users can train models, simulate scenes, and track trajectories for smoother and more realistic results. The tool integrates with Blender software and offers options for training McNeRF models and McLight's skydome estimation network. It also includes a trajectory tracking module for improved trajectory tracking. ChatSim aims to facilitate the simulation of autonomous driving scenarios with collaborative LLM-Agents.
aiosmtpd
aiosmtpd is an asyncio-based SMTP server implementation that provides a modern and efficient way to handle SMTP and LMTP protocols in Python 3. It replaces the outdated asyncore and asynchat modules with asyncio for improved asynchronous I/O operations. The project aims to offer a more user-friendly, extendable, and maintainable solution for handling email protocols in Python applications. It is actively maintained by experienced Python developers and offers full documentation for easy integration and usage.
chatgpt-subtitle-translator
This tool utilizes the OpenAI ChatGPT API to translate text, with a focus on line-based translation, particularly for SRT subtitles. It optimizes token usage by removing SRT overhead and grouping text into batches, allowing for arbitrary length translations without excessive token consumption while maintaining a one-to-one match between line input and output.
stark
STaRK is a large-scale semi-structure retrieval benchmark on Textual and Relational Knowledge Bases. It provides natural-sounding and practical queries crafted to incorporate rich relational information and complex textual properties, closely mirroring real-life scenarios. The benchmark aims to assess how effectively large language models can handle the interplay between textual and relational requirements in queries, using three diverse knowledge bases constructed from public sources.
airbadge
Airbadge is a Stripe addon for Auth.js that simplifies the process of creating a SaaS site by integrating payment, authentication, gating, self-service account management, webhook handling, trials & free plans, session data, and more. It allows users to launch a SaaS app without writing any authentication or payment code. The project is open source and free to use with optional paid features under the BSL License.
mistral-inference
Mistral Inference repository contains minimal code to run 7B, 8x7B, and 8x22B models. It provides model download links, installation instructions, and usage guidelines for running models via CLI or Python. The repository also includes information on guardrailing, model platforms, deployment, and references. Users can interact with models through commands like mistral-demo, mistral-chat, and mistral-common. Mistral AI models support function calling and chat interactions for tasks like testing models, chatting with models, and using Codestral as a coding assistant. The repository offers detailed documentation and links to blogs for further information.
ai-commits-intellij-plugin
AI Commits is a plugin for IntelliJ-based IDEs and Android Studio that generates commit messages using git diff and OpenAI. It offers features such as generating commit messages from diff using OpenAI API, computing diff only from selected files and lines in the commit dialog, creating custom prompts for commit message generation, using predefined variables and hints to customize prompts, choosing any of the models available in OpenAI API, setting OpenAI network proxy, and setting custom OpenAI compatible API endpoint.
llm-functions
LLM Functions is a project that enables the enhancement of large language models (LLMs) with custom tools and agents developed in bash, javascript, and python. Users can create tools for their LLM to execute system commands, access web APIs, or perform other complex tasks triggered by natural language prompts. The project provides a framework for building tools and agents, with tools being functions written in the user's preferred language and automatically generating JSON declarations based on comments. Agents combine prompts, function callings, and knowledge (RAG) to create conversational AI agents. The project is designed to be user-friendly and allows users to easily extend the capabilities of their language models.
llm-vscode
llm-vscode is an extension designed for all things LLM, utilizing llm-ls as its backend. It offers features such as code completion with 'ghost-text' suggestions, the ability to choose models for code generation via HTTP requests, ensuring prompt size fits within the context window, and code attribution checks. Users can configure the backend, suggestion behavior, keybindings, llm-ls settings, and tokenization options. Additionally, the extension supports testing models like Code Llama 13B, Phind/Phind-CodeLlama-34B-v2, and WizardLM/WizardCoder-Python-34B-V1.0. Development involves cloning llm-ls, building it, and setting up the llm-vscode extension for use.
holmesgpt
HolmesGPT is an open-source DevOps assistant powered by OpenAI or any tool-calling LLM of your choice. It helps in troubleshooting Kubernetes, incident response, ticket management, automated investigation, and runbook automation in plain English. The tool connects to existing observability data, is compliance-friendly, provides transparent results, supports extensible data sources, runbook automation, and integrates with existing workflows. Users can install HolmesGPT using Brew, prebuilt Docker container, Python Poetry, or Docker. The tool requires an API key for functioning and supports OpenAI, Azure AI, and self-hosted LLMs.
For similar tasks
evalplus
EvalPlus is a rigorous evaluation framework for LLM4Code, providing HumanEval+ and MBPP+ tests to evaluate large language models on code generation tasks. It offers precise evaluation and ranking, coding rigorousness analysis, and pre-generated code samples. Users can use EvalPlus to generate code solutions, post-process code, and evaluate code quality. The tool includes tools for code generation and test input generation using various backends.
MathCoder
MathCoder is a repository focused on enhancing mathematical reasoning by fine-tuning open-source language models to use code for modeling and deriving math equations. It introduces MathCodeInstruct dataset with solutions interleaving natural language, code, and execution results. The repository provides MathCoder models capable of generating code-based solutions for challenging math problems, achieving state-of-the-art scores on MATH and GSM8K datasets. It offers tools for model deployment, inference, and evaluation, along with a citation for referencing the work.
For similar jobs
sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.
teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.
chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.
BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students
uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.
griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.