inspect_evals

inspect_evals

Collection of evals for Inspect AI

Stars: 236

Visit
 screenshot

Inspect Evals is a repository of community-contributed LLM evaluations for Inspect AI, created in collaboration by the UK AISI, Arcadia Impact, and the Vector Institute. It supports many model providers including OpenAI, Anthropic, Google, Mistral, Azure AI, AWS Bedrock, Together AI, Groq, Hugging Face, vLLM, and Ollama. Users can contribute evaluations, install necessary dependencies, and run evaluations for various models. The repository covers a wide range of evaluation tasks across different domains such as coding, assistants, cybersecurity, safeguards, mathematics, reasoning, knowledge, scheming, multimodal tasks, bias evaluation, personality assessment, and writing tasks.

README:

UK AISI Logo

Welcome to Inspect Evals, a repository of community contributed LLM evaluations for Inspect AI. Inspect Evals was created in collaboration by the UK AISI, Arcadia Impact, and the Vector Institute.

📚 Documentation


Community contributions are welcome and encouraged! Please see the Contributor Guide for details on submitting new evaluations.

Getting Started

First, install the necessary dependencies with:

uv sync

Some of the evaluations require additional dependencies. If your eval needs extra dependencies, instructions for installing them are provided in the list of evals subsection (or the README for that evaluation). For example, to install the dependencies of SWE-Bench evaluation you should run:

uv sync --extra swe_bench

Once you have a model configured, you can run evaluations for it with:

uv run inspect eval inspect_evals/gpqa_diamond --model openai/gpt-4o

If you don't want to specify the --model each time you run an evaluation, create a .env configuration file in your working directory that defines the INSPECT_EVAL_MODEL environment variable along with your API key. For example:

INSPECT_EVAL_MODEL=openai/gpt-4o
OPENAI_API_KEY=<openai-api-key>

Inspect supports many model providers including OpenAI, Anthropic, Google, Mistral, Azure AI, AWS Bedrock, Together AI, Groq, Hugging Face, vLLM, Ollama, and more. See the Model Providers documentation for additional details.

You might also be able to use a newer version of pip (25.1+) to install the project via pip install --group dev .' or pip install --group dev '.[swe_bench]'`. However this is not officially supported.

Documentation

For details on building the documentation, see the documentation guide.

For information on running tests and CI toggles, see the Technical Contribution Guide in CONTRIBUTING.md.

Hardware recommendations

Disk

We recommend having at least 35 GB of free disk space for Inspect Evals: the full installation takes about 10 GB and you'll also need some space for uv cache and datasets cache (most are small, but some take 13 GB such as MMIU).

Running some evals (e.g., CyBench, GDM capabilities evals) may require extra space beyond this because they pull Docker images. We recommend having at least 65 GB of extra space for running evals that have Dockerfiles in their file tree (though you might get away with less space) on top of the 35 GB suggestion above.

In total, you should be comfortable running evals with 100 GB of free space. If you end up running of out space while having 100+ GB of free space available, please let us know — this might be a bug.

RAM

The amount of memory needed for an eval varies significantly with the eval. You'll be able to run most evals with only 0.5 GB of free RAM. However, some evals with larger datasets require 2-3 GB or more. And some evals that use Docker (e.g., some GDM capabilities evals) require up to 32 GB of RAM.

List of Evals

Coding

Assistants

  • Tests whether AI agents can perform real-world time-consuming tasks on the web. Contributed by: @nlpet, @caspardh

    uv run inspect eval inspect_evals/assistant_bench_closed_book_zero_shot
    uv run inspect eval inspect_evals/assistant_bench_closed_book_one_shot
    uv run inspect eval inspect_evals/assistant_bench_web_search_zero_shot
    uv run inspect eval inspect_evals/assistant_bench_web_search_one_shot
    uv run inspect eval inspect_evals/assistant_bench_web_browser
  • A benchmark for evaluating agents' ability to browse the web. The dataset consists of challenging questions that generally require web-access to answer correctly. Contributed by: @AnselmC

    uv run inspect eval inspect_evals/browse_comp
  • Proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs. Contributed by: @max-kaufmann

    uv run inspect eval inspect_evals/gaia
    uv run inspect eval inspect_evals/gaia_level1
    uv run inspect eval inspect_evals/gaia_level2
    uv run inspect eval inspect_evals/gaia_level3
  • A dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Contributed by: @dr3s

    uv run inspect eval inspect_evals/mind2web
  • Tests AI agents' ability to perform realistic, open-ended tasks within simulated computer environments, requiring complex interaction across multiple input modalities. Contributed by: @epatey

    uv run inspect eval inspect_evals/osworld
    uv run inspect eval inspect_evals/osworld_small
  • Evaluate sycophancy of language models across a variety of free-form text-generation tasks. Contributed by: @alexdzm

    uv run inspect eval inspect_evals/sycophancy

Cybersecurity

Safeguards

Mathematics

Reasoning

  • Dataset of natural, grade-school science multiple-choice questions (authored for human tests). Contributed by: @jjallaire

    uv run inspect eval inspect_evals/arc_easy
    uv run inspect eval inspect_evals/arc_challenge
  • Tests AI models on a suite of 23 challenging BIG-Bench tasks that previously proved difficult even for advanced language models to solve. Contributed by: @JoschkaCBraun

    uv run inspect eval inspect_evals/bbh
  • A reasoning capability dataset that replaces each task in BIG-Bench-Hard with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty. Contributed by: @jeqcho

    uv run inspect eval inspect_evals/bbeh
    uv run inspect eval inspect_evals/bbeh_mini
  • Reading comprehension dataset that queries for complex, non-factoid information, and require difficult entailment-like inference to solve. Contributed by: @seddy-aisi

    uv run inspect eval inspect_evals/boolq
  • Evaluates reading comprehension where models must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). Contributed by: @xeon27

    uv run inspect eval inspect_evals/drop
  • Tests models' commonsense reasoning abilities by asking them to select the most likely next step or continuation for a given everyday situation. Contributed by: @jjallaire

    uv run inspect eval inspect_evals/hellaswag
  • Evaluates how well language models can strictly follow detailed instructions, such as writing responses with specific word counts or including required keywords. Contributed by: @adil-a

    uv run inspect eval inspect_evals/ifeval
  • Two linguistics reasoning benchmarks: LingOly (Linguistic Olympiad questions) is a benchmark utilising low resource languages. LingOly-TOO (Linguistic Olympiad questions with Templatised Orthographic Obfuscation) is a benchmark designed to counteract answering without reasoning. Contributed by: @am-bean, @jkhouja

    uv run inspect eval inspect_evals/lingoly
    uv run inspect eval inspect_evals/lingoly_too
  • Assesses multimodal AI models on challenging college-level questions covering multiple academic subjects, requiring detailed visual interpretation, in-depth reasoning, and both multiple-choice and open-ended answering abilities. Contributed by: @shaheenahmedc

    uv run inspect eval inspect_evals/mmmu_multiple_choice
    uv run inspect eval inspect_evals/mmmu_open
  • Evaluating models on multistep soft reasoning tasks in the form of free text narratives. Contributed by: @farrelmahaztra

    uv run inspect eval inspect_evals/musr
  • NIAH evaluates in-context retrieval ability of long context LLMs by testing a model's ability to extract factual information from long-context inputs. Contributed by: @owenparsons

    uv run inspect eval inspect_evals/niah
  • Evaluating models on the task of paraphrase detection by providing pairs of sentences that are either paraphrases or not. Contributed by: @meltemkenis

    uv run inspect eval inspect_evals/paws
  • Measures the model's ability to apply practical, everyday commonsense reasoning about physical objects and scenarios through simple decision-making questions. Contributed by: @seddy-aisi

    uv run inspect eval inspect_evals/piqa
  • Reading comprehension tasks collected from the English exams for middle and high school Chinese students in the age range between 12 to 18. Contributed by: @mdrpanwar

    uv run inspect eval inspect_evals/race_h
  • Set of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. Contributed by: @tknasir

    uv run inspect eval inspect_evals/squad
  • Set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. Contributed by: @xeon27

    uv run inspect eval inspect_evals/winogrande
  • LLM benchmark featuring an average data length surpassing 100K tokens. Comprises synthetic and realistic tasks spanning diverse domains in English and Chinese. Contributed by: @celiawaggoner

    uv run inspect eval inspect_evals/infinite_bench_code_debug
    uv run inspect eval inspect_evals/infinite_bench_code_run
    uv run inspect eval inspect_evals/infinite_bench_kv_retrieval
    uv run inspect eval inspect_evals/infinite_bench_longbook_choice_eng
    uv run inspect eval inspect_evals/infinite_bench_longdialogue_qa_eng
    uv run inspect eval inspect_evals/infinite_bench_math_calc
    uv run inspect eval inspect_evals/infinite_bench_math_find
    uv run inspect eval inspect_evals/infinite_bench_number_string
    uv run inspect eval inspect_evals/infinite_bench_passkey

Knowledge

  • AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving. Contributed by: @bouromain

    uv run inspect eval inspect_evals/agie_aqua_rat
    uv run inspect eval inspect_evals/agie_logiqa_en
    uv run inspect eval inspect_evals/agie_lsat_ar
    uv run inspect eval inspect_evals/agie_lsat_lr
    uv run inspect eval inspect_evals/agie_lsat_rc
    uv run inspect eval inspect_evals/agie_math
    uv run inspect eval inspect_evals/agie_sat_en
    uv run inspect eval inspect_evals/agie_sat_en_without_passage
    uv run inspect eval inspect_evals/agie_sat_math
  • A safety benchmark evaluating language models against risk categories derived from government regulations and company policies. Contributed by: @l1990790120

    uv run inspect eval inspect_evals/air_bench
  • ChemBench is designed to reveal limitations of current frontier models for use in the chemical sciences. It consists of 2786 question-answer pairs compiled from diverse sources. Our corpus measures reasoning, knowledge and intuition across a large fraction of the topics taught in undergraduate and graduate chemistry curricula. It can be used to evaluate any system that can return text (i.e., including tool-augmented systems). Contributed by: @Esther-Guo

    uv run inspect eval inspect_evals/chembench
  • Evaluates an AI model's ability to correctly answer everyday questions that rely on basic commonsense knowledge and understanding of the world. Contributed by: @jjallaire

    uv run inspect eval inspect_evals/commonsense_qa
  • Contains challenging multiple-choice questions created by domain experts in biology, physics, and chemistry, designed to test advanced scientific understanding beyond basic internet searches. Experts at PhD level in the corresponding domains reach 65% accuracy. Contributed by: @jjallaire

    uv run inspect eval inspect_evals/gpqa_diamond
  • A comprehensive evaluation benchmark designed to assess language models' medical capabilities across a wide range of healthcare scenarios. Contributed by: @retroam

    uv run inspect eval inspect_evals/healthbench
    uv run inspect eval inspect_evals/healthbench_hard
    uv run inspect eval inspect_evals/healthbench_consensus
  • Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity's Last Exam consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Contributed by: @SasankYadati

    uv run inspect eval inspect_evals/hle
  • LiveBench is a benchmark designed with test set contamination and objective evaluation in mind by releasing new questions regularly, as well as having questions based on recently-released datasets. Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored accurately and automatically, without the use of an LLM judge. Contributed by: @anaoaktree

    uv run inspect eval inspect_evals/livebench
  • An advanced benchmark that tests both broad knowledge and reasoning capabilities across many subjects, featuring challenging questions and multiple-choice answers with increased difficulty and complexity. Contributed by: @xeon27

    uv run inspect eval inspect_evals/mmlu_pro
  • Evaluate models on 57 tasks including elementary mathematics, US history, computer science, law, and more. Contributed by: @jjallaire, @domdomegg

    uv run inspect eval inspect_evals/mmlu_0_shot
    uv run inspect eval inspect_evals/mmlu_5_shot
  • A Q&A benchmark with questions collected from professional medical board exams. Only includes the English subset of the dataset (which also contains Mandarin Chinese and Taiwanese questions). Contributed by: @bunny-baxter, @JasonBenn

    uv run inspect eval inspect_evals/medqa
  • Questions and answers from the Ordinary National Educational Test (O-NET), administered annually by the National Institute of Educational Testing Service to Matthayom 6 (Grade 12 / ISCED 3) students in Thailand. The exam contains six subjects: English language, math, science, social knowledge, and Thai language. There are questions with multiple-choice and true/false answers. Questions can be in either English or Thai. Contributed by: @bact

    uv run inspect eval inspect_evals/onet_m6
  • Tests model understanding of aviation regulations including ICAO annexes, flight dispatch rules, pilot procedures, and airport ground operations safety protocols. Contributed by: @alexbrooker

    uv run inspect eval inspect_evals/pre_flight
  • Biomedical question answering (QA) dataset collected from PubMed abstracts. Contributed by: @MattFisher

    uv run inspect eval inspect_evals/pubmedqa
  • A regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas). Contributed by: @Esther-Guo

    uv run inspect eval inspect_evals/sosbench
  • The Scientific Knowledge Evaluation benchmark is inspired by the profound principles outlined in the “Doctrine of the Mean” from ancient Chinese philosophy. This benchmark is designed to assess LLMs based on their proficiency in Studying Extensively, Enquiring Earnestly, Thinking Profoundly, Discerning Clearly, and Practicing Assiduously. Each of these dimensions offers a unique perspective on evaluating the capabilities of LLMs in handling scientific knowledge. Contributed by: @Esther-Guo

    uv run inspect eval inspect_evals/sciknoweval
  • A benchmark that evaluates the ability of language models to answer short, fact-seeking questions. Contributed by: @osc245, @jeqcho

    uv run inspect eval inspect_evals/simpleqa
    uv run inspect eval inspect_evals/simpleqa_verified
  • Measure whether a language model is truthful in generating answers to questions using questions that some humans would answer falsely due to a false belief or misconception. Contributed by: @seddy-aisi

    uv run inspect eval inspect_evals/truthfulqa
  • Dataset with 250 safe prompts across ten prompt types that well-calibrated models should not refuse, and 200 unsafe prompts as contrasts that models, for most applications, should refuse. Contributed by: @NelsonG-C

    uv run inspect eval inspect_evals/xstest

Scheming

  • Eliciting unethical behaviour (most famously blackmail) in response to a fictional company-assistant scenario where the model is faced with replacement. Contributed by: @bmillwood-aisi

    uv run inspect eval inspect_evals/agentic_misalignment
  • Test AI's ability to reason about its environment. Contributed by: @ZiyueWang25, @XkunW

    uv run inspect eval inspect_evals/gdm_approved_directories
    uv run inspect eval inspect_evals/gdm_calculator_improvement
    uv run inspect eval inspect_evals/gdm_context_length_mod_instrumental_only
    uv run inspect eval inspect_evals/gdm_context_length_mod_irreversibility_only
    uv run inspect eval inspect_evals/gdm_database_tool
    uv run inspect eval inspect_evals/gdm_latency_calculator
    uv run inspect eval inspect_evals/gdm_max_messages_calculator
    uv run inspect eval inspect_evals/gdm_max_tokens
    uv run inspect eval inspect_evals/gdm_oversight_frequency
    uv run inspect eval inspect_evals/gdm_read_logs
    uv run inspect eval inspect_evals/gdm_turn_off_filters
  • Test AI's ability to reason about and circumvent oversight. Contributed by: @ZiyueWang25

    uv run inspect eval inspect_evals/gdm_classifier_evasion
    uv run inspect eval inspect_evals/gdm_cover_your_tracks
    uv run inspect eval inspect_evals/gdm_oversight_pattern
    uv run inspect eval inspect_evals/gdm_strategic_rule_breaking

Multimodal

  • DocVQA is a Visual Question Answering benchmark that consists of 50,000 questions covering 12,000+ document images. This implementation solves and scores the "validation" split. Contributed by: @evanmiller-anthropic

    uv run inspect eval inspect_evals/docvqa
  • A comprehensive dataset designed to evaluate Large Vision-Language Models (LVLMs) across a wide range of multi-image tasks. The dataset encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions. Contributed by: @Esther-Guo

    uv run inspect eval inspect_evals/mmiu
  • V*Bench is a visual question & answer benchmark that evaluates MLLMs in their ability to process high-resolution and visually crowded images to find and focus on small details. Contributed by: @bienehito

    uv run inspect eval inspect_evals/vstar_bench_attribute_recognition
    uv run inspect eval inspect_evals/vstar_bench_spatial_relationship_reasoning
  • A lightweight visual reasoning benchmark that is (1) Challenging, (2) Lightweight, (3) Diverse, and (4) High-quality. Contributed by: @ItsTania

    uv run inspect eval inspect_evals/Zerobench
    uv run inspect eval inspect_evals/Zerobench Subquestions

Bias

Personality

  • An evaluation suite consisting of multiple personality tests that can be applied to LLMs. Its primary goals are twofold:

    1. Assess a model's default personality: the persona it naturally exhibits without specific prompting.
    2. Evaluate whether a model can embody a specified persona**: how effectively it adopts certain personality traits when prompted or guided. Contributed by: @guiem
    uv run inspect eval inspect_evals/personality_BFI
    uv run inspect eval inspect_evals/personality_TRAIT
    uv run inspect eval inspect_evals/personality_PRIME

Writing

  • A comprehensive evaluation benchmark designed to assess large language models' capabilities across diverse writing tasks. The benchmark evaluates models on various writing domains including academic papers, business documents, creative writing, and technical documentation, with multi-dimensional scoring based on domain-specific criteria. Contributed by: @jtv199

    uv run inspect eval inspect_evals/writingbench

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for inspect_evals

Similar Open Source Tools

For similar tasks

For similar jobs