Evaluator

Open-source library for scalable, reproducible evaluation of AI models and benchmarks.

Stars: 194

Visit

NeMo Evaluator SDK is an open-source platform for robust, reproducible, and scalable evaluation of Large Language Models. It enables running hundreds of benchmarks across popular evaluation harnesses against any OpenAI-compatible model API. The platform ensures auditable and trustworthy results by executing evaluations in open-source Docker containers. NeMo Evaluator SDK is built on four core principles: Reproducibility by Default, Scale Anywhere, State-of-the-Art Benchmarking, and Extensible and Customizable.

README:

NeMo Evaluator SDK

🆕 What's New in 26.01 Release

New Evaluation Harnesses

TAU2-Bench (tau2-bench): Conversational agents in dual-control environments (telecom, airline, retail)
RULER (long-context-eval): Long-context evaluation with configurable sequence lengths (4K to 1M tokens)
CoDec (contamination-detection): Contamination detection - practical and accurate method to detect and quantify training data contamination in large language models
MTEB (mteb): Massive Text Embedding Benchmark

📖 Documentation

NeMo Evaluator SDK is an open-source platform for robust, reproducible, and scalable evaluation of Large Language Models. It enables you to run hundreds of benchmarks across popular evaluation harnesses against any OpenAI-compatible model API. Evaluations execute in open-source Docker containers for auditable and trustworthy results. The platform's containerized architecture allows for the rapid integration of public benchmarks and private datasets.

NeMo Evaluator SDK is built on four core principles to provide a reliable and versatile evaluation experience:

Reproducibility by Default: All configurations, random seeds, and software provenance are captured automatically for auditable and repeatable evaluations.
Scale Anywhere: Run evaluations from a local machine to a Slurm cluster or cloud-native backends like Lepton AI without changing your workflow.
State-of-the-Art Benchmarking: Access a comprehensive suite of over 100 benchmarks from 18 popular open-source evaluation harnesses. See the full list of Supported benchmarks and evaluation harnesses.
Extensible and Customizable: Integrate new evaluation harnesses, add custom benchmarks with proprietary data, and define custom result exporters for existing MLOps tooling.

⚙️ How It Works: Launcher and Core Engine

The platform consists of two main components:

nemo-evaluator (The Evaluation Core Engine): A Python library that manages the interaction between an evaluation harness and the model being tested.
nemo-evaluator-launcher (The CLI and Orchestration): The primary user interface and orchestration layer. It handles configuration, selects the execution environment, and launches the appropriate container to run the evaluation.

Most users typically interact with nemo-evaluator-launcher, which serves as a universal gateway to different benchmarks and harnesses. However, it is also possible to interact directly with nemo-evaluator by following this guide.

📊 Supported Benchmarks and Evaluation Harnesses

NeMo Evaluator Launcher provides pre-built evaluation containers for different evaluation harnesses through the NVIDIA NGC catalog. Each harness supports a variety of benchmarks, which can then be called via nemo-evaluator. This table provides a list of benchmark names per harness. A more detailed list of task names can be found in the list of NGC containers.

Container	Description	NGC Catalog	Latest Tag	Supported benchmarks
bfcl	Function calling	Link	`26.01`	BFCL v2 and v3
bigcode-evaluation-harness	Code generation evaluation	Link	`26.01`	MBPP, MBPP-Plus, HumanEval, HumanEval+, Multiple (cpp, cs, d, go, java, jl, js, lua, php, pl, py, r, rb, rkt, rs, scala, sh, swift, ts)
compute-eval	CUDA code evaluation	Link	`26.01`	CCCL, Combined Problems, CUDA
CoDec	Contamination detection	Link	`26.01`	CoDec
garak	Safety and vulnerability testing	Link	`26.01`	Garak
genai-perf	GenAI performance benchmarking	Link	`26.01`	GenAI Perf Generation & Summarization
helm	Holistic evaluation framework	Link	`26.01`	MedHelm
hle	Academic knowledge and problem solving	Link	`26.01`	HLE
ifbench	Instruction following	Link	`26.01`	IFBench
livecodebench	Coding	Link	`26.01`	LiveCodeBench (v1-v6, 0724_0125, 0824_0225)
lm-evaluation-harness	Language model benchmarks	Link	`26.01`	ARC Challenge (also multilingual), GSM8K, HumanEval, HumanEval+, MBPP, MBPP+, MINERVA Math, RACE, AGIEval, BBH, BBQ, CSQA, Frames, Global MMLU, GPQA-D, HellaSwag (also multilingual), IFEval, MGSM, MMLU, MMLU-Pro, MMLU-ProX (de, es, fr, it, ja), MMLU-Redux, MUSR, OpenbookQA, Piqa, Social IQa, TruthfulQA, WikiLingua, WinoGrande
long-context-eval	Long context evaluation	Link	`26.01`	Ruler
mmath	Multilingual math reasoning	Link	`26.01`	EN, ZH, AR, ES, FR, JA, KO, PT, TH, VI
mtbench	Multi-turn conversation evaluation	Link	`26.01`	MT-Bench
MTEB	Multimodal toolbox for evaluating embeddings and retrieval systems	Link	`26.01`	fiqa, miracl, ViDoRe
nemo-skills	Language model benchmarks (science, math, agentic)	Link	`26.01`	AIME 24 & 25, BFCL_v3, GPQA, HLE, LiveCodeBench, MMLU, MMLU-Pro
profbench	Professional domains in Business and Scientific Research	Link	`26.01`	ProfBench
safety-harness	Safety and bias evaluation	Link	`26.01`	Aegis v2, WildGuard
scicode	Coding for scientific research	Link	`26.01`	SciCode
simple-evals	Common evaluation tasks	Link	`26.01`	GPQA-D, MATH-500, AIME 24 & 25, HumanEval, HumanEval+, MGSM, MMLU (also multilingual), MMLU-Pro, MMLU-lite (AR, BN, DE, EN, ES, FR, HI, ID, IT, JA, KO, MY, PT, SW, YO, ZH), SimpleQA, BrowseComp, HealthBench
tau2-bench	TAU2 benchmark evaluation	Link	`26.01`	TAU2-Bench telecom, airline, retail
tooltalk	Tool usage evaluation	Link	`26.01`	ToolTalk
vlmevalkit	Vision-language model evaluation	Link	`26.01`	AI2D, ChartQA, MMMU, MathVista-MINI, OCRBench, SlideVQA

🚀 Quickstart

Get your first evaluation result in minutes. This guide uses your local machine to run a small benchmark against an OpenAI API-compatible endpoint.

1. Install the Launcher

The launcher is the only package required to get started.

pip install nemo-evaluator-launcher

2. Set Up Your Model Endpoint

NeMo Evaluator works with any model that exposes an OpenAI-compatible endpoint. For this quickstart, we will use the OpenAI API.

What is an OpenAI-compatible endpoint? A server that exposes /v1/chat/completions and /v1/completions endpoints, matching the OpenAI API specification.

Options for model endpoints:

Hosted endpoints (fastest): Use ready-to-use hosted models from providers like build.nvidia.com that expose OpenAI-compatible APIs with no hosting required.
Self-hosted options: Host your own models using tools like NVIDIA NIM, vLLM, or TensorRT-LLM for full control over your evaluation environment.
Models trained with NeMo framework: Host your models trained with NeMo framework by deploying them as OpenAI-compatible endpoints using NeMo Export-Deploy. More detailed user guide here.

Getting an NGC API Key for build.nvidia.com:

To use out-of-the-box build.nvidia.com APIs, you need an API key:

Register an account at build.nvidia.com.
In the Setup menu under Keys/Secrets, generate an API key.
Set the environment variable by executing export NGC_API_KEY=<YOUR_API_KEY>.

3. Run Your First Evaluation

Run a small evaluation on your local machine. The launcher automatically pulls the correct container and executes the benchmark. The list of benchmarks is directly configured in the YAML file.

Configuration Examples: Explore ready-to-use configuration files in packages/nemo-evaluator-launcher/examples/ for local, Lepton, and Slurm deployments with various model hosting options (vLLM, NIM, hosted endpoints).

Once you have the example configuration file, either by cloning this repository or downloading one directly such as local_nvidia_nemotron_nano_9b_v2.yaml, you can run the following command:

nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_nvidia_nemotron_nano_9b_v2.yaml -o execution.output_dir=<YOUR_OUTPUT_LOCAL_DIR>

After running this command, you will see a job_id, which can be used to track the job and its results. All logs will be available in your <YOUR_OUTPUT_LOCAL_DIR>.

4. Check Your Results

Results, logs, and run configurations are saved locally. Inspect the status of the evaluation job by using the corresponding job_id:

nemo-evaluator-launcher status <job_id_or_invocation_id>

🤝 Contribution Guide

We welcome community contributions. Please see our Contribution Guide for instructions on submitting pull requests, reporting issues, and suggesting features.

📄 License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: NeMo Evaluator Documentation

🐛 Known issues

nel ls might require docker authenthication and currently does not support fetching credentials from known password management systems such as MacOS's Keychain or GNOME Keyring.

For Tasks:

Click tags to check more tools for each tasks

evaluate language models benchmark ai models reproduce evaluation results scale evaluations customize benchmarks

For Jobs:

data scientist machine learning engineer research scientist ai model evaluator nlp engineer

Alternative AI tools for Evaluator

Similar Open Source Tools

No tools available

For similar tasks

ai4math-papers

The 'ai4math-papers' repository contains a collection of research papers related to AI applications in mathematics, including automated theorem proving, synthetic theorem generation, autoformalization, proof refactoring, premise selection, benchmarks, human-in-the-loop interactions, and constructing examples/counterexamples. The papers cover various topics such as neural theorem proving, reinforcement learning for theorem proving, generative language modeling, formal mathematics statement curriculum learning, and more. The repository serves as a valuable resource for researchers and practitioners interested in the intersection of AI and mathematics.

github

: 126

Evaluator

github

: 194

AirspeedVelocity.jl

AirspeedVelocity.jl is a tool designed to simplify benchmarking of Julia packages over their lifetime. It provides a CLI to generate benchmarks, compare commits/tags/branches, plot benchmarks, and run benchmark comparisons for every submitted PR as a GitHub action. The tool freezes the benchmark script at a specific revision to prevent old history from affecting benchmarks. Users can configure options using CLI flags and visualize benchmark results. AirspeedVelocity.jl can be used to benchmark any Julia package and offers features like generating tables and plots of benchmark results. It also supports custom benchmarks and can be integrated into GitHub actions for automated benchmarking of PRs.

github

: 110

hallucination-leaderboard

This leaderboard evaluates the hallucination rate of various Large Language Models (LLMs) when summarizing documents. It uses a model trained by Vectara to detect hallucinations in LLM outputs. The leaderboard includes models from OpenAI, Anthropic, Google, Microsoft, Amazon, and others. The evaluation is based on 831 documents that were summarized by all the models. The leaderboard shows the hallucination rate, factual consistency rate, answer rate, and average summary length for each model.

github

: 1.4k

h2o-llmstudio

H2O LLM Studio is a framework and no-code GUI designed for fine-tuning state-of-the-art large language models (LLMs). With H2O LLM Studio, you can easily and effectively fine-tune LLMs without the need for any coding experience. The GUI is specially designed for large language models, and you can finetune any LLM using a large variety of hyperparameters. You can also use recent finetuning techniques such as Low-Rank Adaptation (LoRA) and 8-bit model training with a low memory footprint. Additionally, you can use Reinforcement Learning (RL) to finetune your model (experimental), use advanced evaluation metrics to judge generated answers by the model, track and compare your model performance visually, and easily export your model to the Hugging Face Hub and share it with the community.

github

: 4.1k

llm-jp-eval

LLM-jp-eval is a tool designed to automatically evaluate Japanese large language models across multiple datasets. It provides functionalities such as converting existing Japanese evaluation data to text generation task evaluation datasets, executing evaluations of large language models across multiple datasets, and generating instruction data (jaster) in the format of evaluation data prompts. Users can manage the evaluation settings through a config file and use Hydra to load them. The tool supports saving evaluation results and logs using wandb. Users can add new evaluation datasets by following specific steps and guidelines provided in the tool's documentation. It is important to note that using jaster for instruction tuning can lead to artificially high evaluation scores, so caution is advised when interpreting the results.

github

: 137

Awesome-LLM

Awesome-LLM is a curated list of resources related to large language models, focusing on papers, projects, frameworks, tools, tutorials, courses, opinions, and other useful resources in the field. It covers trending LLM projects, milestone papers, other papers, open LLM projects, LLM training frameworks, LLM evaluation frameworks, tools for deploying LLM, prompting libraries & tools, tutorials, courses, books, and opinions. The repository provides a comprehensive overview of the latest advancements and resources in the field of large language models.

github

: 22.1k

bocoel

BoCoEL is a tool that leverages Bayesian Optimization to efficiently evaluate large language models by selecting a subset of the corpus for evaluation. It encodes individual entries into embeddings, uses Bayesian optimization to select queries, retrieves from the corpus, and provides easily managed evaluations. The tool aims to reduce computation costs during evaluation with a dynamic budget, supporting models like GPT2, Pythia, and LLAMA through integration with Hugging Face transformers and datasets. BoCoEL offers a modular design and efficient representation of the corpus to enhance evaluation quality.

github

: 270

For similar jobs

llm-jp-eval

github

: 137

AlignBench

AlignBench is the first comprehensive evaluation benchmark for assessing the alignment level of Chinese large models across multiple dimensions. It includes introduction information, data, and code related to AlignBench. The benchmark aims to evaluate the alignment performance of Chinese large language models through a multi-dimensional and rule-calibrated evaluation method, enhancing reliability and interpretability.

github

: 230

LiveBench

LiveBench is a benchmark tool designed for Language Model Models (LLMs) with a focus on limiting contamination through monthly new questions based on recent datasets, arXiv papers, news articles, and IMDb movie synopses. It provides verifiable, objective ground-truth answers for accurate scoring without an LLM judge. The tool offers 18 diverse tasks across 6 categories and promises to release more challenging tasks over time. LiveBench is built on FastChat's llm_judge module and incorporates code from LiveCodeBench and IFEval.

github

: 873

evalchemy

Evalchemy is a unified and easy-to-use toolkit for evaluating language models, focusing on post-trained models. It integrates multiple existing benchmarks such as RepoBench, AlpacaEval, and ZeroEval. Key features include unified installation, parallel evaluation, simplified usage, and results management. Users can run various benchmarks with a consistent command-line interface and track results locally or integrate with a database for systematic tracking and leaderboard submission.

github

: 317

home-assistant-datasets

This package provides a collection of datasets for evaluating AI Models in the context of Home Assistant. It includes synthetic data generation, loading data into Home Assistant, model evaluation with different conversation agents, human annotation of results, and visualization of improvements over time. The datasets cover home descriptions, area descriptions, device descriptions, and summaries that can be performed on a home. The tool aims to build datasets for future training purposes.

github

: 180

Evaluator

github

: 194

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 980

agentcloud

AgentCloud is an open-source platform that enables companies to build and deploy private LLM chat apps, empowering teams to securely interact with their data. It comprises three main components: Agent Backend, Webapp, and Vector Proxy. To run this project locally, clone the repository, install Docker, and start the services. The project is licensed under the GNU Affero General Public License, version 3 only. Contributions and feedback are welcome from the community.

github

: 583