LLMEvaluation

A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.

Stars: 94

Visit

The LLMEvaluation repository is a comprehensive compendium of evaluation methods for Large Language Models (LLMs) and LLM-based systems. It aims to assist academics and industry professionals in creating effective evaluation suites tailored to their specific needs by reviewing industry practices for assessing LLMs and their applications. The repository covers a wide range of evaluation techniques, benchmarks, and studies related to LLMs, including areas such as embeddings, question answering, multi-turn dialogues, reasoning, multi-lingual tasks, ethical AI, biases, safe AI, code generation, summarization, software performance, agent LLM architectures, long text generation, graph understanding, and various unclassified tasks. It also includes evaluations for LLM systems in conversational systems, copilots, search and recommendation engines, task utility, and verticals like healthcare, law, science, financial, and others. The repository provides a wealth of resources for evaluating and understanding the capabilities of LLMs in different domains.

README:

Evaluation of LLM and LLM based Systems

Compendium of LLM Evaluation methods

Introduction

The aim of this compendium is to assist academics and industry professionals in creating effective evaluation suites tailored to their specific needs. It does so by reviewing the top industry practices for assessing large language models (LLMs) and their applications. This work goes beyond merely cataloging benchmarks and evaluation studies; it encompasses a comprehensive overview of all effective and practical evaluation techniques, including those embedded within papers that primarily introduce new LLM methodologies and tasks. I plan to periodically update this survey with any noteworthy and shareable evaluation methods that I come across. I aim to create a resource that will enable anyone with queries—whether it's about evaluating a large language model (LLM) or an LLM application for specific tasks, determining the best methods to assess LLM effectiveness, or understanding how well an LLM performs in a particular domain—to easily find all the relevant information needed for these tasks. Additionally, I want to highlight various methods for evaluating the evaluation tasks themselves, to ensure that these evaluations align effectively with business or academic objectives.

My view on LLM Evaluation: Deck, and SF Big Analytics and AICamp video Analytics Vidhya (Data Phoenix Mar 5) (by Andrei Lopatenko)

The github repository

Reviews and Surveys
Leaderboards and Arenas
Evaluation Software
LLM Evaluation articles in tech media and blog posts from companies
Large benchmarks
Evaluation of evaluation, Evaluation theory, evaluation methods, analysis of evaluation
Long Comprehensive Studies
HITL (Human in the Loop)
LLM as Judge
LLM Evaluation
LLM Systems
Other collections
Citation

Reviews and Surveys

Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey, UMD, Jan 2025, arxiv
AI Benchmarks and Datasets for LLM Evaluation, Dec 2024, arxiv, a survey of many LLM benchmarks
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, Dec 2024, arxiv
A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations, EMNLP 2024, ACLAnthology
A Survey on Evaluation of Multimodal Large Language Models, aug 2024, arxiv
A Survey of Useful LLM Evaluation, Jun 2024, arxiv
Evaluating Large Language Models: A Comprehensive Survey , Oct 2023 arxiv:
A Survey on Evaluation of Large Language Models Jul 2023 arxiv:
Through the Lens of Core Competency: Survey on Evaluation of Large Language Models, Aug 2023 , arxiv:
for industry-specific surveys of evaluation methods for industries such as medical, see in respective parts of this compendium

Leaderboards and Arenas

New Hard Leaderboard by HuggingFace leaderboard description, blog post
The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input, DeepMind, Jan 2025, arxiv Leaderboard
LMSys Arena (explanation:)
Aider Polyglot, code edit benchmark, Aider Polyglot
Salesforce's Contextual Bench leaderboard hugging face an overview of how different LLMs perform across a variety of contextual tasks,
WebQA - Multimodal and Multihop QA, by WebQA WebQA leaderboard
ArenaHard Leaderboard Paper: From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline, UC Berkeley, Jun 2024, arxiv github repo ArenaHard benchmark
OpenGPT-X Multi- Lingual European LLM Leaderboard, evaluation of LLMs for many European languages - on HuggingFace
AllenAI's ZeroEval LeaderBoard benchmark: ZeroEval from AllenAI unified framework for evaluating (large) language models on various reasoning tasks
OpenLLM Leaderboard
MTEB
SWE Bench
AlpacaEval leaderboard Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators, Apr 2024, arxiv code
Open Medical LLM Leaderboard from HF Explanation
Gorilla, Berkeley function calling Leaderboard Explanation
WildBench WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
Enterprise Scenarios, Patronus
Vectara Hallucination Leaderboard
Ray/Anyscale's LLM Performance Leaderboard (explanation:)
Hugging Face LLM Performance hugging face leaderboard
Multi-task Language Understanding on MMLU

Evaluation Software

EleutherAI LLM Evaluation Harness
Eureka, Microsoft, A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings. github Sep 2024 arxiv
OpenAI Evals
AutoGenBench -- A Tool for Measuring and Evaluating AutoGen Agents from Microsoft see an example how it's used in the evaluation of Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
Phoenix Arize AI LLM observability and evaluation platform
MTEB
OpenICL Framework
RAGAS
Confident-AI DeepEval The LLM Evaluation Framework (unittest alike evaluation of LLM outputs)
ML Flow Evaluate
MosaicML Composer
Microsoft Prompty
NVidia Garac evaluation of LLMs vulnerabilities Generative AI Red-teaming & Assessment Kit
Toolkit from Mozilla AI for LLM as judge evaluation tool: lm-buddy eval tool model: Prometheus
ZeroEval from AllenAI unified framework for evaluating (large) language models on various reasoning tasks LeaderBoard
TruLens
Promptfoo
BigCode Evaluation Harness
LangFuse LLM Engineering platform with observability and evaluation tools
LLMeBench see LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking
ChainForge
Ironclad Rivet
LM-PUB-QUIZ: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models, arxiv pdf github repository

---

LLM Evaluation articles in tech media and blog posts and podcasts from companies

Apoorva Joshi on LLM Application Evaluation and Performance Improvements, InfoQ,Frb 2005, infoq
A Framework for Building Micro Metrics for LLM System Evaluation, Jan 2025, InfoQ
Evaluate LLMs using Evaluation Harness and Hugging Face TGI/vLLM, Sep 2024, blog
The LLM Evaluation guidebook ⚖️ from HuggingFace, Oct 2024, Hugging Face Evaluation guidebook
Let's talk about LLM Evaluation, HuggingFace, article
Using LLMs for Evaluation LLM-as-a-Judge and other scalable additions to human quality ratings. Aug 2024, Deep Learning Focus
Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks, Microsoft, Aug 2024, ACL 2024
Introducing SimpleQA, OpenAI, Oct 2024 OpenAI
Catch me if you can! How to beat GPT-4 with a 13B model, LM sys org
Why it’s impossible to review AIs, and why TechCrunch is doing it anyway Techcrun mat 2024
A.I. has a measurement problem, NY Times, Apr 2024
Beyond Accuracy: The Changing Landscape Of AI Evaluation, Forbes, Mar 2024
Mozilla AI Exploring LLM Evaluation at scale
Evaluation part of How to Maximize LLM Performance
Mozilla AI blog published multiple good articles in Mozilla AI blog
Andrej Karpathy on evaluation X
From Meta on evaluation of Llama 3 models github
DeepMind AI Safety evaluation June 24 deepmind blog, Introducing Frontier Safety Framework
AI Snake Oil, June 2024, AI leaderboards are no longer useful. It's time to switch to Pareto curves.
Hamel Dev March 2024, Your AI Product Needs Eval. How to construct domain-specific LLM evaluation systems

Large benchmarks

MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs, Sep 2024, Audesk AI, arxiv
MMLU Pro Massive Multitask Language Understanding - Pro version, Jun 2024, arxiv
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks EMNLP 2022, pdf
Measuring Massive Multitask Language Understanding, MMLU, ICLR, 2021, arxiv MMLU dataset
BigBench: Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models, 2022, arxiv, datasets
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, Oct 2022, arxiv

Evaluation of evaluation, Evaluation theory, evaluation methods, analysis of evaluation

The LLM Evaluation guidebook ⚖️ from HuggingFace, Oct 2024, Hugging Face Evaluation guidebook
Evaluating the Evaluations: A Perspective on Benchmarks, Opinion paper, Amazon, Jan 2025, SIGIR
Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks, Max Planck Institute for Intelligent Systems, Tübingen, May 2024, ICML 2024, arxiv
A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations, EMNLP 2024, ACLAnthology
Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference, Dec 2024, arxiv
Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations, Nov 2024, Anthropic, arxiv
Lessons from the Trenches on Reproducible Evaluation of Language Models, May 2024, arxiv
Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat, Nov 2024, arxiv
Towards Evaluation Guidelines for Empirical Studies involving LLMs, Nov 2024, arxiv
Sabotage Evaluations for Frontier Models, Anthropic, Nov 2024, paper blog post
AI Benchmarks and Datasets for LLM Evaluation, Dec 2024, arxiv, a survey of many LLM benchmarks
Lessons from the Trenches on Reproducible Evaluation of Language Models, May 2024, arxiv
Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks, Aug 2024, ACL 2024
Synthetic data in evaluation*, see Chapter 3 in Best Practices and Lessons Learned on Synthetic Data for Language Models, Apr 2024, arxiv
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline, UC Berkeley, Jun 2024, arxiv github repo
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards, National Center for AI (NCAI), Feb 2024, arxiv
Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress, Feb 2024, arxiv
Are We on the Right Way for Evaluating Large Vision-Language Models?, Apr 2024, arxiv
What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases, Apr 2024, arxiv
Detecting Pretraining Data from Large Language Models, Oct 2023, arxiv
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings, Apr 2024, arxiv
Faithful model evaluation for model-based metrics, EMNLP 2023, amazon science
AI Snake Oil, June 2024, AI leaderboards are no longer useful. It's time to switch to Pareto curves.
State of What Art? A Call for Multi-Prompt LLM Evaluation , Aug 2024, Transactions of the Association for Computational Linguistics (2024) 12
Data Contamination Through the Lens of Time, Abacus AI etc, Oct 2023, arxiv
Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models, ICML 2023, mlr press
Are Emergent Abilities of Large Language Models a Mirage? Apr 23 arxiv
Don't Make Your LLM an Evaluation Benchmark Cheater nov 2023 arxiv
Holistic Evaluation of Text-to-Image Models, Stanford etc NeurIPS 2023, NeurIPS
Model Spider: Learning to Rank Pre-Trained Models Efficiently, Nanjing University etc NeurIPS 2023
Evaluating Open-QA Evaluation, 2023, arxiv
(RE: stat methods ) Prediction-Powered Inference Jan 23 arxiv PPI++: Efficient Prediction-Powered Inference nov 23, arxiv
Elo Uncovered: Robustness and Best Practices in Language Model Evaluation, Nov 2023 arxiv
A Theory of Dynamic Benchmarks, ICLR 2023, University of California, Berkeley, arxiv
Holistic Evaluation of Language Models, Center for Research on Foundation Models (CRFM), Stanford, Oct 2022, arxiv
What Will it Take to Fix Benchmarking in Natural Language Understanding?, NY University , Google Brain, Oct 2022, arxiv
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models, Jun 2022, arxiv
Evaluating Question Answering Evaluation, 2019, ACL

Long Comprehensive Studies

Evaluation of OpenAI o1: Opportunities and Challenges of AGI, University of Alberta etc, Sep 2024, arxiv
TrustLLM: Trustworthiness in Large Language Models, Jan 2024, arxiv
Evaluating AI systems under uncertain ground truth: a case study in dermatology, Jul 2023, Google DeepMind etc, arxiv

HITL (Human in the Loop)

Developing a Framework for Auditing Large Language Models Using Human-in-the-Loop, Univ of Washington, Stanford, Amazon AI etc, Feb 2024, arxiv
Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation, Cohere, Nov 2023, arxiv
Evaluating Question Answering Evaluation, 2019, ACL

LLM as Judge

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge, Jan 2025, arxiv
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods, Tsinghua University, Dec 2024, arxiv
Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation, Seoul National University , Naver etc Oct 2024, arxiv
JudgeBench: A Benchmark for Evaluating LLM-based Judges, UC Berkeley, Oct 2024, arxiv
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline, Oct 2024, UC Berkeley, arxiv
Using LLMs for Evaluation LLM-as-a-Judge and other scalable additions to human quality ratings. Aug 2024, Deep Learning Focus
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates, Aug 2024, arxiv
Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks, Jun 2024, arxiv
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges, University of Massachusetts Amherst, Meta, Jun 2024, arxiv
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators, Stanford University, Apr 2024, arxiv leaderboard code
Large Language Models are Inconsistent and Biased Evaluators, Grammarly Duke Nvidia, May 2024, arxiv
Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries, University of Toronto and Vector Institute, Sep 2024, arxiv
Evaluating LLMs at Detecting Errors in LLM Responses, Penn State University, Allen AI etc, Apr 2024, arxiv
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models, Cohere, Apr 2024, arxiv
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators, Mar 2024, arxiv
LLM Evaluators Recognize and Favor Their Own Generations, Apr 2024, pdf
Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences, Apr 2024, arxiv
The Generative AI Paradox on Evaluation: What It Can Solve, It May Not Evaluate, Feb 2024, arxiv
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena Jun 2023, arxiv
Discovering Language Model Behaviors with Model-Written Evaluations, Dec 2022, arxiv
Benchmarking Foundation Models with Language-Model-as-an-Examiner, 2022, NEURIPS
Red Teaming Language Models with Language Models, Feb 2022, arxiv
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate, Aug 2023, arxiv
ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning, Sep 2023, arxiv
Style Over Substance: Evaluation Biases for Large Language Models, Jul 2023, arxiv
Large Language Models Are State-of-the-Art Evaluators of Translation Quality, Feb 2023, arxiv
Large Language Models Are State-of-the-Art Evaluators of Code Generation, Apr 2023, researchgate

LLM Evaluation

Embeddings

MMTEB: Massive Multilingual Text Embedding Benchmark, Feb 2025, hugging face, leaderboard Brief: 1043 languages in total, primarily in Bitext mining (text pairing), but also 255 in classification, 209 in clustering, and 142 in Retrieval., 550 tasks, anything from sentiment analysis, question-answering reranking, to long-document retrieval. 17 domains, like legal, religious, programming, web, social, medical, blog, academic, etc. Across this collection of tasks, we subdivide into a lot of separate benchmarks, like MTEB(eng, v2), MTEB(Multilingual, v1), MTEB(Law, v1). Our new MTEB(eng, v2) is much smaller and faster than the original English MTEB, making submissions much cheaper and simpler. from Tom Aarsen's linkedin
MTEB: Massive Text Embedding Benchmark Oct 2022 [arxiv](https://arxiv.org/abs/2210.07316 Leaderboard) Leaderboard
Marqo embedding benchmark for eCommerce at Huggingface, text to image and category to image tasks
The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding, openreview pdf
MMTEB: Community driven extension to MTEB repository
Chinese MTEB C-MTEB repository
French MTEB repository

In Context Learning

HellaSwag, HellaSwag: Can a Machine Really Finish Your Sentence? 2019, arxiv Paper + code + dataset https://rowanzellers.com/hellaswag/
The LAMBADA dataset: Word prediction requiring a broad discourse context 2016, arxiv

Hallucinations

The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input, DeepMind, Jan 2025, arxiv Leaderboard
Introducing SimpleQA, OpenAI, Oct 2024 OpenAI
A Survey of Hallucination in Large Visual Language Models, Oct 2024, See Chapter IV, Evaluation of Hallucinations arxiv
Long-form factuality in large language models, Google DeepMind etc, Mar 2024, arxiv
TRUSTLLM: TRUSTWORTHINESS IN LARGE LANGUAGE MODELS: A PRINCIPLE AND BENCHMARK, Lehigh University, University of Notre Dame, MS Research, etc, Jan 2024, arxiv,
INVITE: A testbed of automatically generated invalid questions to evaluate large language models for hallucinations, Amazon Science, EMNLP 2023, amazon science
Generating Benchmarks for Factuality Evaluation of Language Models, Jul 2023, arxiv
AlignScore: Evaluating Factual Consistency with a Unified Alignment Function, May 2023, arxiv
ChatGPT as a Factual Inconsistency Evaluator for Text Summarization, Mar 2023, arxiv
HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models, Dec 2023, ACL
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models, Tencent AI lab etc, Sep 2023, arxiv
Measuring Faithfulness in Chain-of-Thought Reasoning, Anthropic etc, Jul 2023, [arxiv
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation, University of Washington etc, May 2023, arxiv repository
TRUE: Re-evaluating Factual Consistency Evaluation, Apt 2022, arxiv

Question answering

QA is used in many vertical domains, see Vertical section bellow

CoReQA: Uncovering Potentials of Language Models in Code Repository Question Answering, Jan 2025, arxiv
Unveiling the power of language models in chemical research question answering, Jan 2025, Nature, communication chemistry ScholarChemQA Dataset
Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses, Oct 2024, Salesforce, arxiv Answer Engine (RAG) Evaluation Repository
Introducing SimpleQA, OpenAI, Oct 2024 OpenAI
Are Large Language Models Consistent over Value-laden Questions?, Jul 2024, arxiv
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge, Jun 2019, ACL
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering, Sep 2018, arxiv OpenBookQA dataset at AllenAI
Jin, Di, et al. "What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams., 2020, arxiv MedQA
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, 2018, arxiv ARC Easy dataset ARC dataset
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions, 2019, arxiv BoolQ dataset
HellaSwag, HellaSwag: Can a Machine Really Finish Your Sentence? 2019, arxiv Paper + code + dataset https://rowanzellers.com/hellaswag/
PIQA: Reasoning about Physical Commonsense in Natural Language, Nov 2019, arxiv PIQA dataset
Crowdsourcing Multiple Choice Science Questions arxiv SciQ dataset
WinoGrande: An Adversarial Winograd Schema Challenge at Scale, 2017, arxiv Winogrande dataset
TruthfulQA: Measuring How Models Mimic Human Falsehoods, Sep 2021, arxiv
TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages, 2020, arxiv data
Natural Questions: A Benchmark for Question Answering Research, Transactions ACL 2019

Multi Turn

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models Nov 2023, arxiv
MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues Feb 24 arxiv
How Well Can LLMs Negotiate? NEGOTIATIONARENA Platform and Analysis Feb 2024 arxiv
Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models, Oct 2023, arxiv
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, NeurIPS 2023, NeurIPS
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback, Sep 2023, arxiv

Reasoning

Evaluating Generalization Capability of Language Models across Abductive, Deductive and Inductive Logical Reasoning, Feb 2025, Coling 2025
JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models, Jan 2025, arxiv
See 5.3 Evaluations chapter of DeepSeek R3 tech report on how new frontier models are evaluated Dec 2024 DeepSeek-V3 Technical Report and 3.1. DeepSeek-R1 Evaluation Chapter of DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Jan 2025 arxiv
Evaluating Generalization Capability of Language Models across Abductive, Deductive and Inductive Logical Reasoning, Jan 2025, Proceedings of the 31st International Conference on Computational Linguistics)
FrontierMath at EpochAI, FrontierAI page, FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI, Nov 2024, arxiv
Easy Problems That LLMs Get Wrong, May 2024, arxiv, a comprehensive Linguistic Benchmark designed to evaluate the limitations of Large Language Models (LLMs) in domains such as logical reasoning, spatial intelligence, and linguistic understanding
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning, NeurIPS 2024 Track Datasets and Benchmarks Spotlight, Sep 2024, OpenReview
Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks 2023, arxiv
LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models, arxiv
Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering, Feb 24, arxiv
Competition-Level Problems are Effective LLM Evaluators, Dec 23, arxiv
Eyes Can Deceive: Benchmarking Counterfactual Reasoning Capabilities of Multimodal Large Language Models, Apr 2024, arxiv
MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning, Oct 2023, arxiv

Multi-Lingual

Mexa: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment, (ICLR 2025 submission) open review
Evalita-LLM: Benchmarking Large Language Models on Italian, Feb 2025, arxiv
Multilingual Large Language Models: A Systematic Survey, Nov 2024, see Evaluation chapter about details of evaluation of multi-lingual large language models Evaluation chapter, arxiv
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models, Taobao & Tmall Group of Alibaba, Nov 2024, arxiv
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs, Oct 2024, arxiv
Towards Multilingual LLM Evaluation for European Languages, TU Dresden etc, Oct 2024, arxiv
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models, Oct 2024, arxiv
LLMzSzŁ: a comprehensive LLM benchmark for Polish, Jan 2024, arxiv
Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?, Microsoft/CMU etc , Sep 2024, arxiv
AlGhafa Evaluation Benchmark for Arabic Language Models Dec 23, ACL Anthology ACL pdf article
CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian, Dec 2024, Tenth Italian Conference on Computational Linguistics,
Evaluating and Advancing Multimodal Large Language Models in Ability Lens, Nov 2024, arxiv
Introducing the Open Ko-LLM Leaderboard: Leading the Korean LLM Evaluation Ecosystem HF blog
Heron-Bench: A Benchmark for Evaluating Vision Language Models in Japanese , Apr 2024 arxiv
BanglaQuAD: A Bengali Open-domain Question Answering Dataset, Oct 2024, arxiv
MultiPragEval: Multilingual Pragmatic Evaluation of Large Language Models, Jun 2024, arxiv
Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?, Mar 2024, Findings of the Association for Computational Linguistics: EACL 2024
The Invalsi Benchmark: measuring Language Models Mathematical and Language understanding in Italian, Mar 2024, arxiv
MEGA: Multilingual Evaluation of Generative AI, Mar 2023, arxiv
Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?, Apr 2024, arxiv
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model, Cohere, Feb 2024, arxiv see Evaluation chapter with details how to evaluate multi lingual model capabilities
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models, 2023, NIPS website
LAraBench: Benchmarking Arabic AI with Large Language Models, May 23, arxiv
AlignBench: Benchmarking Chinese Alignment of Large Language Models, Nov 2023, arxiv
CLUE: A Chinese Language Understanding Evaluation Benchmark, Apr 2020, arxiv CLUEWSC(Winograd Scheme Challenge)

Multi-Lingual Embedding tasks

The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding, openreview pdf
Chinese MTEB C-MTEB repository
French MTEB repository
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models, May 2023, arxiv

Multi-Modal

MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation, Feb 2025, arxiv
Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey, Jan 2025, arxiv
LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models, Nov 2024, IEEE
ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?, Dec 2024, arxiv
RealWorldQA, Apr 2024, HuggingFace
Image2Struct: Benchmarking Structure Extraction for Vision-Language Models, Oct 2024, arxiv
MMBench: Is Your Multi-modal Model an All-Around Player?, Oct 2024 springer ECCV 2024
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models, Oct 2024, arxiv
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI, Apr 2024, arxiv
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI, CVPR 2024, CVPR
ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Ablation Capability for Large Vision-Language Models, Dec 2024, open review github for the benchmark and evaluation framework
Careless Whisper: Speech-to-Text Hallucination Harms, FAccT '24, ACM
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?, Oct 2024 arxiv
HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning, Oct 2024, Computer Vision – ECCV 2024
VHELM: A Holistic Evaluation of Vision Language Models, Oct 2024, arxiv
Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models, Reka AI, May 2024 arxiv dataset blog post
Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis, Aug 2024, arxiv
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models, Jun 2024, arxiv
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models, Jun 2024, arxiv
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models, Jun 2024, arxiv
Holistic Evaluation of Text-to-Image Models Nov 23 arxiv
VBench: Comprehensive Benchmark Suite for Video Generative Models Nov 23 arxiv
Evaluating Text-to-Visual Generation with Image-to-Text Generation, Apr 2024, arxiv
What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases, Apr 2024, arxiv
Are We on the Right Way for Evaluating Large Vision-Language Models?, Apr 2024, arxiv
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning, Nov 2023, arxiv
BLINK: Multimodal Large Language Models Can See but Not Perceive, Apr 2024, arxiv github
Eyes Can Deceive: Benchmarking Counterfactual Reasoning Capabilities of Multimodal Large Language Models, Apr 2024, arxiv
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings, Apr 2024, arxiv
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models, Apr 2024, arxiv
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts, Oct 2023, [arxiv](MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts)
Evaluation part of https://arxiv.org/abs/2404.18930, Apr 2024, arxiv, repository
VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use, Aug 2023. arxiv
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities, Aug 2023, arxiv
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension, Jul 2023, arxiv
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark, NeurIPS 2023, NeurIPS
Holistic Evaluation of Text-to-Image Models, Stanford etc NeurIPS 2023, NeurIPS
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality, Apr 2023 arxiv

Instruction Following

Evaluating Large Language Models at Evaluating Instruction Following Oct 2023, arxiv
Find the INTENTION OF INSTRUCTION: Comprehensive Evaluation of Instruction Understanding for Large Language Models, Dec 2024, arxiv
HREF: Human Response-Guided Evaluation of Instruction Following in Language Models, Dec 2024, arxiv
CFBench: A Comprehensive Constraints-Following Benchmark for LLMs. Aug 2024, arxiv
Instruction-Following Evaluation for Large Language Models, IFEval, Nov 2023, arxiv
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets, Jul 2023, arxiv , FLASK dataset
DINGO: Towards Diverse and Fine-Grained Instruction-Following Evaluation, Mar 2024, aaai, pdf
LongForm: Effective Instruction Tuning with Reverse Instructions, Apr 2023, arxiv dataset

Ethical AI

Evaluating the Moral Beliefs Encoded in LLMs, Jul 23 arxiv
AI Deception: A Survey of Examples, Risks, and Potential Solutions Aug 23 arxiv
Aligning AI With Shared Human Value, Aug 20 - Feb 23, arxiv Re: ETHICS benchmark
What are human values, and how do we align AI to them?, Mar 2024, pdf
TrustLLM: Trustworthiness in Large Language Models, Jan 2024, arxiv
Helpfulness, Honesty, Harmlessness (HHH) framework from Antrhtopic, introduced in A General Language Assistantas a Laboratory for Alignment, 2021, arxiv, it's in BigBench now bigbench
WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models, April 2024, arxiv
Chapter 19 in The Ethics of Advanced AI Assistants, Apr 2024, Google DeepMind, pdf at google
BEHONEST: Benchmarking Honesty of Large Language Models, June 2024, arxiv

Biases

FairPair: A Robust Evaluation of Biases in Language Models through Paired Perturbations, Apr 2024 arxiv
BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation, 2021, arxiv, dataset
“I’m fully who I am”: Towards centering transgender and non-binary voices to measure biases in open language generation, ACM FAcct 2023, amazon science
This Land is {Your, My} Land: Evaluating Geopolitical Biases in Language Models, May 2023, arxiv

Safe AI

Lessons From Red Teaming 100 Generative AI Products, Microsoft, Jan 2025, arxiv
Trading Inference-Time Compute for Adversarial Robustness, OpenAI, Jan 2025, arxiv
Medical large language models are vulnerable to data-poisoning attacks, New York University, Jan 2025, Nature Medicine
Benchmark for general-purpose AI chat model, December 2024, AILuminate from ML Commons, mlcommons website
Fooling LLM graders into giving better grades through neural activity guided adversarial prompting, Stanford University, Dec 2024, arxiv
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors, Jun 2024, arxiv
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning, Weapons of Mass Destruction Proxy (WMDP) benchmark, Mar 2024, arxiv
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal, Feb 2024, arxiv HarmBench data and code
ECCV 2024 MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models, Shanghai AI Laboratory, etc, Jan 2024, github arxiv nov 2023
Introducing v0.5 of the AI Safety Benchmark from MLCommons, ML Commons, Google Research etc Apr 2024, arxiv
SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI, Virtue AI, etc, Oct 2024, arxiv
Beyond Prompt Brittleness: Evaluating the Reliability and Consistency of Political Worldviews in LLMs , University of Stuttgart, etc, Nov 2024, MIT Press
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?, Center for AI Safety etc, Jul 2024. arxiv
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems, Zhongguancun Laboratory, etc, Jan 2024, arxiv
LLMSecCode: Evaluating Large Language Models for Secure Coding, Chalmers University of Technology, Aug 2024, arxiv
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI, IBM Research etc, Sep 2024, arxiv
DetoxBench: Benchmarking Large Language Models for Multitask Fraud & Abuse Detection, Amazon.com, Sep 2024, arxiv
Purple Llama, an umbrella project from Meta, Meta, Purple Llama repository
How Many Are in This Image A Safety Evaluation Benchmark for Vision LLMs, ECCV 2024, ECCV 2024
Explore, Establish, Exploit: Red Teaming Language Models from Scratch, MIT CSAIL etc, Jun 2023, arxiv
Rethinking Backdoor Detection Evaluation for Language Models, University of Southern California, Aug 2024, arxiv pdf
Gradient-Based Language Model Red Teaming, Google Research & Anthropic, Jan 24, arxiv
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models, University of Pennsylvania, ETH Zurich, etc, , Mar 2024, arxiv
Announcing a Benchmark to Improve AI Safety MLCommons has made benchmarks for AI performance—now it's time to measure safety, Apr 2024 IEEE Spectrum
Model evaluation for extreme risks, Google DeepMind, OpenAI, etc, May 2023, arxiv
A StrongREJECT for Empty Jailbreaks, Center for Human-Compatible AI, UC Berkeley, Feb 2024, arxiv StrongREJECT Benchmark
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, Anthropic, Redwood Research etc Jan 2024, arxiv
How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs, Nov 2023, arxiv
On Evaluating Adversarial Robustness of Large Vision-Language Models, NeurIPS 2023, NeuriPS 2023
How Robust is Google's Bard to Adversarial Image Attacks?, Sep 2023, arxiv

Cybersecurity

CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models, July 2023, Meta arxiv
CYBERSECEVAL 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models, Apr 2024, Meta arxiv
Benchmarking OpenAI o1 in Cyber Security, Oct 2024, arxiv
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models, Aug 2024, arxiv

Code Generating LLMs

and other software co-pilot tasks

Mutation-Guided LLM-based Test Generation at Meta, Jan 2025, see 4 ENGINEERS’ EVALUATION OF ACH chapter of arxiv and blog post at meta
SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI, Oct 2024, arxiv
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models , Oct 2024, arxiv
Aider Polyglot, code editing benchmark Aider polyglot site
A Survey on Evaluating Large Language Models in Code Generation Tasks, Peking University etc, Aug 2024, arxiv
LLMSecCode: Evaluating Large Language Models for Secure Coding, Aug 2024, arxiv
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming Feb 24 arxiv
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code, Berkeley, MIT, Cornell, Mar 2024, arxiv
Introducing SWE-bench Verified, OpenAI, Aug 2024, OpenAI
SWE Bench SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Feb 2024 arxiv Tech Report
Gorilla Functional Calling Leaderboard, Berkeley Leaderboard
DevBench: A Comprehensive Benchmark for Software Development, Mar 2024,arxiv
Evaluating Large Language Models Trained on Code HumanEval Jul 2022 arxiv
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation Feb 21 arxiv
MBPP (Mostly Basic Python Programming) benchmark, introduced in Program Synthesis with Large Language Models , 2021 papers with code data
CodeMind: A Framework to Challenge Large Language Models for Code Reasoning, Feb 2024, arxiv
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution, Jan 2024, arxiv
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning, Jul 2022, arxiv code at salesforce github

Summarization

Evaluation & Hallucination Detection for Abstractive Summaries, online blog article
A dataset and benchmark for hospital course summarization with adapted large language models, Dec 2024, Journal of the American Medical Informatics Association
A Field Guide to Automatic Evaluation of LLM-Generated Summaries, SIGIR 2024, SIGIR 24
Benchmarking Large Language Models for News Summarization , Jan 2024, Transactions of ACL
SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation, May 2023, arxiv benchmark data
Human-like Summarization Evaluation with ChatGPT, Apr 2023, arxiv
Evaluating the Factual Consistency of Large Language Models Through News Summarization, Nov 2022, arxiv
USB: A Unified Summarization Benchmark Across Tasks and Domains, May 2023, arxiv
QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization, Jan 2021, arxiv
SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization, Nov 2021, arxiv github data
WikiAsp: A Dataset for Multi-domain Aspect-based Summarization, 2021, Transactions ACL dataset

LLM quality (generic methods: overfitting, redundant layers etc)

WeightWatcher

Inference Performance

Ray/Anyscale's LLM Performance Leaderboard (explanation:)
MLCommons MLPerf benchmarks (inference) MLPerf announcement of the LLM track

Agent LLM Architectures

Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making, Oct 2024, arxiv
AutoGenBench -- A Tool for Measuring and Evaluating AutoGen Agents from Microsoft see an example how it's used in the evaluation of Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
LLM4RL: Enhancing Reinforcement Learning with Large Language Models, Aug 2024, IEEE Explore
Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and Execution of LLM Agents in an Auction Arena, Oct 2023, arxiv
Chapter 4 4 LLM-based autonomous agent evaluation in A survey on large language model based autonomous agents, Front. Comput. Sci., 2024, Front. Comput. Sci., 2024, at Springer
LLM-Deliberation: Evaluating LLMs with Interactive Multi-Agent Negotiation Games, Sep 2023,arxiv
AgentBench: Evaluating LLMs as Agents, Aug 2023, arxiv
How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments, Mar 2024, arxiv
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents, Jan 2024, arxiv
ProAgent: Building Proactive Cooperative Agents with Large Language Models, Aug 2023, arxiv
Towards A Unified Agent with Foundation Models, Jul 2023, arxiv
RestGPT: Connecting Large Language Models with Real-World RESTful APIs, Jun 2023, arxiv
Large Language Models Are Semi-Parametric Reinforcement Learning Agents, Jun 2023, arxiv

Long Text Generation

Suri: Multi-constraint Instruction Following for Long-form Text Generation, Jun 2024, arxiv
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs, Aug 2024, arxiv
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding, Aug 2023, arxiv
HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models, Sep 2024, arxiv

Graph understanding

GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and Benchmarking, May 2023, arxiv
LLM4DyG: Can Large Language Models Solve Spatial-Temporal Problems on Dynamic Graphs? Oct 2023, arxiv
Talk like a Graph: En Graphs for Large Language Models, Oct 2023, arxiv
Open Graph Benchmark: Datasets for Machine Learning on Graphs, NeurIPS 2020
Can Language Models Solve Graph Problems in Natural Language? NeurIPS 2023
Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis, Aug 2023, [https://arxiv.org/abs/2308.11224]

Reward Models

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style, Oct 2024, arxiv
HelpSteer2: Open-source dataset for training top-performing reward models, Aug 2024, arxiv
RewardBench: Evaluating Reward Models for Language Modeling, Mar 2024, arxiv
MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues Feb 24 arxiv

Various unclassified tasks

(TODO as there are more than three papers per class, make a class a separate chapter in this Compendium)

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline, UC Berkeley, Jun 2024, arxiv github repo ArenaHard benchmark
Fooling LLM graders into giving better grades through neural activity guided adversarial prompting, Dec 2024, arxiv
LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation, Jan 2025, arxiv
DesignQA: A Multimodal Benchmark for Evaluating Large Language Models’ Understanding of Engineering Documentation, Dec 2024, The American Society of Mechanical Engineers
OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions, Dec 2024, arxiv
Holmes ⌕ A Benchmark to Assess the Linguistic Competence of Language Models , Dec 2024, MIT Press Transactions of ACL, 2024
BENCHAGENTS: Automated Benchmark Creation with Agent Interaction, Oct 2024, arxiv
EscapeBench: Pushing Language Models to Think Outside the Box, Dec 2024, arxiv
OLMES: A Standard for Language Model Evaluations, Jun 2024, arxiv
Tulu 3: Pushing Frontiers in Open Language Model Post-Training, Nov 2024, arxiv see 7.1 Open Language Model Evaluation System (OLMES) and AllenAI Githib rep for Olmes
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making, Oct 2024, arxiv
Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks, Nov 2024, arxiv
Evaluating Superhuman Models with Consistency Checks, Apr 2024, IEEE
To the Globe (TTG): Towards Language-Driven Guaranteed Travel Planning, Meta AI, Oct 2024, arxiv evaluation for tasks of travel planning
Assessing the Performance of Human-Capable LLMs -- Are LLMs Coming for Your Job?, Oct 2024, arxiv, SelfScore, a novel benchmark designed to assess the performance of automated Large Language Model (LLM) agents on help desk and professional consultation task
Should We Really Edit Language Models? On the Evaluation of Edited Language Models, Oct 2024, arxiv
DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMs, EMNLP 2024, Oct 2024, arxiv, Repository for DyKnow
Jeopardy dataset at HuggingFace, huggingface
A framework for few-shot language model evaluation, Zenodo, Jul 2024, Zenodo
ORAN-Bench-13K: An Open Source Benchmark for Assessing LLMs in Open Radio Access Networks, Jul 2024 arxiv
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models, Aug 2023, arxiv
Evaluation of Response Generation Models: Shouldn’t It Be Shareable and Replicable?, Dec 2022, Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM) Github repository for Human Evaluation Protocol
From Babbling to Fluency: Evaluating the Evolution of Language Models in Terms of Human Language Acquisition, Oct 2024, arxiv
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph, June 2024, arxiv
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style, Oct 2024, arxiv
Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study Mar 24, WSDM 24, ms blog
How Much are Large Language Models Contaminated? A Comprehensive Survey and the LLMSanitize Library, Nanyang Technological University, Mar 2024, arxiv
LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models, jul 2023 arxiv
OpenEQA: From word models to world models, Meta, Apr 2024, Understanding physical spaces by Models, Meta AI blog
Is Your LLM Outdated? Benchmarking LLMs & Alignment Algorithms for Time-Sensitive Knowledge. Apr 2024, arxiv
ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models, Apr 2024, arxiv
LongEmbed: Extending Embedding Models for Long Context Retrieval, Apr 2024, arxiv, benchmark for long context tasks, repository for LongEmbed
LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents, Feb 2024, arxiv
Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT, Feb 2024, arxiv, LoCoV1 benchmark for long context LLM,
A User-Centric Benchmark for Evaluating Large Language Models, Apr 2024, arxiv, data of user centric benchmark at github
Evaluating Quantized Large Language Models, Tsinghua University etc, International Conference on Machine Learning, PMLR 2024, arxiv
RACE: Large-scale ReAding Comprehension Dataset From Examinations, 2017, arxiv RACE dataset at CMU
CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models, 2020, arxiv CrowS-Pairs dataset
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs, Jun 2019, ACL data
RewardBench: Evaluating Reward Models for Language Modeling, Mar 2024, arxiv
Toward informal language processing: Knowledge of slang in large language models, EMNLP 2023, amazon science
FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability, Feb 2024, arxiv
Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs, 05 2024,Bird, a big benchmark for large-scale database grounded in text-to-SQL tasks, containing 12,751 pairs of text-to-SQL data and 95 databases with a total size of 33.4 GB, spanning 37 professional domain arxiv data and leaderboard
MuSiQue: Multihop Questions via Single-hop Question Composition, Aug 2021, arxiv
Evaluating Copyright Takedown Methods for Language Models, June 2024, arxiv

LLM Systems

RAG Evaluation

Google Frames Dataset for evaluation of RAG systems, Sep 2024, [arxiv paper: Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses, Oct 2024, Salesforce, arxiv Answer Engine (RAG) Evaluation Repository ](https://arxiv.org/abs/2409.12941) Hugging Face, dataset
RAGAS: Automated Evaluation of Retrieval Augmented Generation Jul 23, arxiv
ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems Nov 23, arxiv
Evaluating Retrieval Quality in Retrieval-Augmented Generation, Apr 2024, arxiv
IRSC: A Zero-shot Evaluation Benchmark for Information Retrieval through Semantic Comprehension in Retrieval-Augmented Generation Scenarios, Sep 2024, arxiv

Conversational systems

And Dialog systems

A survey on chatbots and large language models: Testing and evaluation techniques, Jan 2025, Natural Language Processing Journal Mar 2025
How Well Can Large Language Models Reflect? A Human Evaluation of LLM-generated Reflections for Motivational Interviewing Dialogues, Jan 2025, Proceedings of the 31st International Conference on Computational Linguistics COLING
Benchmark for general-purpose AI chat model, December 2024, AILuminate from ML Commons, mlcommons website
Comparative Analysis of Finetuning Strategies and Automated Evaluation Metrics for Large Language Models in Customer Service Chatbots, Aug 2024, preprint
Introducing v0.5 of the AI Safety Benchmark from MLCommons, Apr 2024, arxiv
Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI Feb 24, Nature
CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems, Jun 2024, arxiv
Simulated user feedback for the LLM production, TDS
How Well Can LLMs Negotiate? NEGOTIATIONARENA Platform and Analysis Feb 2024 arxiv
Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs, Apr 2024, arxiv
A Two-dimensional Zero-shot Dialogue State Tracking Evaluation Method using GPT-4, Jun 2024, arxiv

Copilots

Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise, Stanford, Oct 2024, arxiv
From Interaction to Impact: Towards Safer AI Agents Through Understanding and Evaluating UI Operation Impacts, University of Washington , Apple, Oct 2024, arxiv
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming Feb 24 arxiv
ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models, Apr 2024, arxiv

Search and Recommendation Engines

Investigating Users' Search Behavior and Outcome with ChatGPT in Learning-oriented Search Tasks, SIGIR-AP 2024, ACM
Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large Language Model Recommendation,RecSys 2023
Is ChatGPT a Good Recommender? A Preliminary Study Apr 2023 arxiv
IRSC: A Zero-shot Evaluation Benchmark for Information Retrieval through Semantic Comprehension in Retrieval-Augmented Generation Scenarios, Sep 2024, arxiv
LLMRec: Benchmarking Large Language Models on Recommendation Task, Aug 2023, arxiv
OpenP5: Benchmarking Foundation Models for Recommendation, Jun 2023, researchgate
Marqo embedding benchmark for eCommerce at Huggingface, text to image and category to image tasks
LaMP: When Large Language Models Meet Personalization, Apr 2023, arxiv
Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses, Oct 2024, Salesforce, arxiv Answer Engine (RAG) Evaluation Repository
BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives, Feb 2024, arxiv
Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents, Apr 2023, arxiv
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models, Oct 2021, arxiv
BENCHMARK : LoTTE, Long-Tail Topic-stratified Evaluation for IR that features 12 domain-specific search tests, spanning StackExchange communities and using queries from GooAQ, ColBERT repository wth the benchmark data
LongEmbed: Extending Embedding Models for Long Context Retrieval, Apr 2024, arxiv, benchmark for long context tasks, repository for LongEmbed
Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT, Feb 2024, arxiv, LoCoV1 benchmark for long context LLM,
STARK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases, Apr 2024, arxiv code github
Constitutional AI: Harmlessness from AI Feedback, Sep 2022 arxiv (See Appendix B Identifying and Classifying Harmful Conversations, other parts)

Task Utility

Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications, Dec 2024, arxiv
Assessing and Verifying Task Utility in LLM-Powered Applications, May 2024, arxiv
Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications, Feb 2024, arxiv

Verticals

Healthcare and medicine

MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models, Dec 2024, openreview arxiv benchmark code and data at github
Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval, Nature, Jan 2025, Scientific reporta Nature
Medical large language models are vulnerable to data-poisoning attacks, Jan 2025, Nature Medicine
A dataset and benchmark for hospital course summarization with adapted large language models, Dec 2024, Journal of the American Medical Informatics Association
MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework, Oct 2024, arxiv
A framework for human evaluation of large language models in healthcare derived from literature review, September 2024, Nature Digital Medicine
Evaluation and mitigation of cognitive biases in medical language models, Oct 2024 Nature
Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI Feb 24, Nature
Evaluation and mitigation of the limitations of large language models in clinical decision-making, July 2024, Nature Medicine
Evaluating Generative AI Responses to Real-world Drug-Related Questions, June 2024, Psychiatry Research
MedExQA: Medical Question Answering Benchmark with Multiple Explanations, Jun 2024, arxiv
Clinical Insights: A Comprehensive Review of Language Models in Medicine, Aug 2024, arxiv See table 2 for evaluation
Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data Jan 2024 arxiv
Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis, Jan 2024, arxiv
A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry, May 2024, arxiv
Evaluating large language models in medical applications: a survey, May 2024, arxiv
MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering, 2022, PMLR
What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams, MedQA benchmark, Sep 2020, arxiv
PubMedQA: A Dataset for Biomedical Research Question Answering, 2019, acl
Open Medical LLM Leaderboard from HF Explanation
Evaluating Large Language Models on a Highly-specialized Topic, Radiation Oncology Physics, Apr 2023, arxiv
Assessing the Accuracy of Responses by the Language Model ChatGPT to Questions Regarding Bariatric Surgery, Apr 2023, pub med
Can LLMs like GPT-4 outperform traditional AI tools in dementia diagnosis? Maybe, but not today, Jun 2023, arxiv
Evaluating the use of large language model in identifying top research questions in gastroenterology, Mar 2023, nature
Evaluating AI systems under uncertain ground truth: a case study in dermatology, Jul 2023, arxiv
MedDialog: Two Large-scale Medical Dialogue Datasets, Apr 2020, arxiv
An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition, 2015, article html
DrugBank 5.0: a major update to the DrugBank database for 2018, 2018, paper html]
A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models, May 2024, nature, dataset
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records, Aug 2023, arxiv

Law

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models, NeurIPS 2023
LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain, EMNLP 2023
Multi-LexSum: Real-world Summaries of Civil Rights Lawsuits at Multiple Granularities NeurIPS 2022

Science

Unveiling the power of language models in chemical research question answering, Jan 2025, Nature, communication chemistry
SciRepEval: A Multi-Format Benchmark for Scientific Document Representations, 2022, arxiv
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks, NeurIPS 2023, NeurIPS 2023
GPQA: A Graduate-Level Google-Proof Q&A Benchmark, Nov 2023, arxiv gpqa benchmark dataset
MATH Mathematics Aptitude Test of Heuristics, Measuring Mathematical Problem Solving With the MATH Dataset, Nov 2021 arxiv

Math

How well do large language models perform in arithmetic tasks?, Mar 2023, arxiv
FrontierMath at EpochAI, FrontierAI page, FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI, Nov 2024, arxiv
Cmath: Can your language model pass chinese elementary school math test?, Jun 23, arxiv
GSM8K paperwithcode repository github

Financial

Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering, Feb 24, arxiv
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance, Jun 2023, arxiv
BloombergGPT: A Large Language Model for Finance (see Chapter 5 Evaluation), Mar 2023, arxiv
FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets, Oct 2023, arxiv

Other

Understanding the Capabilities of Large Language Models for Automated Planning, May 2023, arxiv

Other Collections

LLM/VLM Benchmarks by Aman Chadha
Awesome LLMs Evaluation Papers, a list of papers mentioned in the Evaluating Large Language Models: A Comprehensive Survey, Nov 2023

Citation

@article{Lopatenko2024CompendiumLLMEvaluation,
  title   = {Compendium of LLM Evaluation methods},
  author  = {Lopatenko, Andrei},
  year    = {2024},
  note    = {\url{https://github.com/alopatenko/LLMEvaluation}}
}

For Tasks:

Click tags to check more tools for each tasks

evaluate language models assess model performance benchmark ai systems analyze llm capabilities develop evaluation suites

For Jobs:

data scientist machine learning engineer ai researcher nlp specialist research scientist

Alternative AI tools for LLMEvaluation

Similar Open Source Tools

LLMEvaluation

github

: 94

LLM-PLSE-paper

LLM-PLSE-paper is a repository focused on the applications of Large Language Models (LLMs) in Programming Language and Software Engineering (PL/SE) domains. It covers a wide range of topics including bug detection, specification inference and verification, code generation, fuzzing and testing, code model and reasoning, code understanding, IDE technologies, prompting for reasoning tasks, and agent/tool usage and planning. The repository provides a comprehensive collection of research papers, benchmarks, empirical studies, and frameworks related to the capabilities of LLMs in various PL/SE tasks.

github

: 125

interpret

InterpretML is an open-source package that incorporates state-of-the-art machine learning interpretability techniques under one roof. With this package, you can train interpretable glassbox models and explain blackbox systems. InterpretML helps you understand your model's global behavior, or understand the reasons behind individual predictions. Interpretability is essential for: - Model debugging - Why did my model make this mistake? - Feature Engineering - How can I improve my model? - Detecting fairness issues - Does my model discriminate? - Human-AI cooperation - How can I understand and trust the model's decisions? - Regulatory compliance - Does my model satisfy legal requirements? - High-risk applications - Healthcare, finance, judicial, ...

github

: 6.4k

Awesome-LLM-Prune

This repository is dedicated to the pruning of large language models (LLMs). It aims to serve as a comprehensive resource for researchers and practitioners interested in the efficient reduction of model size while maintaining or enhancing performance. The repository contains various papers, summaries, and links related to different pruning approaches for LLMs, along with author information and publication details. It covers a wide range of topics such as structured pruning, unstructured pruning, semi-structured pruning, and benchmarking methods. Researchers and practitioners can explore different pruning techniques, understand their implications, and access relevant resources for further study and implementation.

github

: 141

vector-search-class-notes

The 'vector-search-class-notes' repository contains class materials for a course on Long Term Memory in AI, focusing on vector search and databases. The course covers theoretical foundations and practical implementation of vector search applications, algorithms, and systems. It explores the intersection of Artificial Intelligence and Database Management Systems, with topics including text embeddings, image embeddings, low dimensional vector search, dimensionality reduction, approximate nearest neighbor search, clustering, quantization, and graph-based indexes. The repository also includes information on the course syllabus, project details, selected literature, and contributions from industry experts in the field.

github

: 316

awesome-sound_event_detection

The 'awesome-sound_event_detection' repository is a curated reading list focusing on sound event detection and Sound AI. It includes research papers covering various sub-areas such as learning formulation, network architecture, pooling functions, missing or noisy audio, data augmentation, representation learning, multi-task learning, few-shot learning, zero-shot learning, knowledge transfer, polyphonic sound event detection, loss functions, audio and visual tasks, audio captioning, audio retrieval, audio generation, and more. The repository provides a comprehensive collection of papers, datasets, and resources related to sound event detection and Sound AI, making it a valuable reference for researchers and practitioners in the field.

github

: 147

LMOps

LMOps is a research initiative focusing on fundamental research and technology for building AI products with foundation models, particularly enabling AI capabilities with Large Language Models (LLMs) and Generative AI models. The project explores various aspects such as prompt optimization, longer context handling, LLM alignment, acceleration of LLMs, LLM customization, and understanding in-context learning. It also includes tools like Promptist for automatic prompt optimization, Structured Prompting for efficient long-sequence prompts consumption, and X-Prompt for extensible prompts beyond natural language. Additionally, LLMA accelerators are developed to speed up LLM inference by referencing and copying text spans from documents. The project aims to advance technologies that facilitate prompting language models and enhance the performance of LLMs in various scenarios.

github

: 3.6k

Agent

Agent is a RustSBI specialized domain knowledge quiz LLM tool that extracts domain knowledge from various sources such as Rust Documentation, RISC-V Documentation, Bouffalo Docs, Bouffalo SDK, and Xiangshan Docs. It also provides resources for LLM prompt engineering and RAG engineering, including guides and existing projects related to retrieval-augmented generation (RAG) systems.

github

: 101

AI-PhD-S25

AI-PhD-S25 is a mono-repo for the DOTE 6635 course on AI for Business Research at CUHK Business School. The course aims to provide a fundamental understanding of ML/AI concepts and methods relevant to business research, explore applications of ML/AI in business research, and discover cutting-edge AI/ML technologies. The course resources include Google CoLab for code distribution, Jupyter Notebooks, Google Sheets for group tasks, Overleaf template for lecture notes, replication projects, and access to HPC Server compute resource. The course covers topics like AI/ML in business research, deep learning basics, attention mechanisms, transformer models, LLM pretraining, posttraining, causal inference fundamentals, and more.

github

: 64

AI-PhD-S24

AI-PhD-S24 is a mono-repo for the PhD course 'AI for Business Research' at CUHK Business School in Spring 2024. The course aims to provide a basic understanding of machine learning and artificial intelligence concepts/methods used in business research, showcase how ML/AI is utilized in business research, and introduce state-of-the-art AI/ML technologies. The course includes scribed lecture notes, class recordings, and covers topics like AI/ML fundamentals, DL, NLP, CV, unsupervised learning, and diffusion models.

github

: 90

ChatLaw

ChatLaw is an open-source legal large language model tailored for Chinese legal scenarios. It aims to combine LLM and knowledge bases to provide solutions for legal scenarios. The models include ChatLaw-13B and ChatLaw-33B, trained on various legal texts to construct dialogue data. The project focuses on improving logical reasoning abilities and plans to train models with parameters exceeding 30B for better performance. The dataset consists of forum posts, news, legal texts, judicial interpretations, legal consultations, exam questions, and court judgments, cleaned and enhanced to create dialogue data. The tool is designed to assist in legal tasks requiring complex logical reasoning, with a focus on accuracy and reliability.

github

: 7.0k

VisionLLM

VisionLLM is a series of large language models designed for vision-centric tasks. The latest version, VisionLLM v2, is a generalist multimodal model that supports hundreds of vision-language tasks, including visual understanding, perception, and generation.

github

: 966

mlcourse.ai

mlcourse.ai is an open Machine Learning course by OpenDataScience (ods.ai), led by Yury Kashnitsky (yorko). The course offers a perfect balance between theory and practice, with math formulae in lectures and practical assignments including Kaggle Inclass competitions. It is currently in a self-paced mode, guiding users through 10 weeks of content covering topics from Pandas to Gradient Boosting. The course provides articles, lectures, and assignments to enhance understanding and application of machine learning concepts.

github

: 9.9k

ByteMLPerf

ByteMLPerf is an AI Accelerator Benchmark that focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and versatility of software and hardware. Byte MLPerf has the following characteristics: - Models and runtime environments are more closely aligned with practical business use cases. - For ASIC hardware evaluation, besides evaluate performance and accuracy, it also measure metrics like compiler usability and coverage. - Performance and accuracy results obtained from testing on the open Model Zoo serve as reference metrics for evaluating ASIC hardware integration.

github

: 184

awesome-ml-gen-ai-elixir

A curated list of Machine Learning (ML) and Generative AI (GenAI) packages and resources for the Elixir programming language. It includes core tools for data exploration, traditional machine learning algorithms, deep learning models, computer vision libraries, generative AI tools, livebooks for interactive notebooks, and various resources such as books, videos, and articles. The repository aims to provide a comprehensive overview for experienced Elixir developers and ML/AI practitioners exploring different ecosystems.

github

: 89

LLMs-at-DoD

This repository contains tutorials for using Large Language Models (LLMs) in the U.S. Department of Defense. The tutorials utilize open-source frameworks and LLMs, allowing users to run them in their own cloud environments. The repository is maintained by the Defense Digital Service and welcomes contributions from users.

github

: 67

For similar tasks

LLMEvaluation

github

: 94

AlignBench

AlignBench is the first comprehensive evaluation benchmark for assessing the alignment level of Chinese large models across multiple dimensions. It includes introduction information, data, and code related to AlignBench. The benchmark aims to evaluate the alignment performance of Chinese large language models through a multi-dimensional and rule-calibrated evaluation method, enhancing reliability and interpretability.

github

: 230

writing

The LLM Creative Story-Writing Benchmark evaluates large language models based on their ability to incorporate a set of 10 mandatory story elements in a short narrative. It measures constraint satisfaction and literary quality by grading models on character development, plot structure, atmosphere, storytelling impact, authenticity, and execution. The benchmark aims to assess how well models can adapt to rigid requirements, remain original, and produce cohesive stories using all assigned elements.

github

: 97

hallucination-leaderboard

This leaderboard evaluates the hallucination rate of various Large Language Models (LLMs) when summarizing documents. It uses a model trained by Vectara to detect hallucinations in LLM outputs. The leaderboard includes models from OpenAI, Anthropic, Google, Microsoft, Amazon, and others. The evaluation is based on 831 documents that were summarized by all the models. The leaderboard shows the hallucination rate, factual consistency rate, answer rate, and average summary length for each model.

github

: 1.4k

h2o-llmstudio

H2O LLM Studio is a framework and no-code GUI designed for fine-tuning state-of-the-art large language models (LLMs). With H2O LLM Studio, you can easily and effectively fine-tune LLMs without the need for any coding experience. The GUI is specially designed for large language models, and you can finetune any LLM using a large variety of hyperparameters. You can also use recent finetuning techniques such as Low-Rank Adaptation (LoRA) and 8-bit model training with a low memory footprint. Additionally, you can use Reinforcement Learning (RL) to finetune your model (experimental), use advanced evaluation metrics to judge generated answers by the model, track and compare your model performance visually, and easily export your model to the Hugging Face Hub and share it with the community.

github

: 4.1k

llm-jp-eval

LLM-jp-eval is a tool designed to automatically evaluate Japanese large language models across multiple datasets. It provides functionalities such as converting existing Japanese evaluation data to text generation task evaluation datasets, executing evaluations of large language models across multiple datasets, and generating instruction data (jaster) in the format of evaluation data prompts. Users can manage the evaluation settings through a config file and use Hydra to load them. The tool supports saving evaluation results and logs using wandb. Users can add new evaluation datasets by following specific steps and guidelines provided in the tool's documentation. It is important to note that using jaster for instruction tuning can lead to artificially high evaluation scores, so caution is advised when interpreting the results.

github

: 125

Awesome-LLM

Awesome-LLM is a curated list of resources related to large language models, focusing on papers, projects, frameworks, tools, tutorials, courses, opinions, and other useful resources in the field. It covers trending LLM projects, milestone papers, other papers, open LLM projects, LLM training frameworks, LLM evaluation frameworks, tools for deploying LLM, prompting libraries & tools, tutorials, courses, books, and opinions. The repository provides a comprehensive overview of the latest advancements and resources in the field of large language models.

github

: 22.1k

bocoel

BoCoEL is a tool that leverages Bayesian Optimization to efficiently evaluate large language models by selecting a subset of the corpus for evaluation. It encodes individual entries into embeddings, uses Bayesian optimization to select queries, retrieves from the corpus, and provides easily managed evaluations. The tool aims to reduce computation costs during evaluation with a dynamic budget, supporting models like GPT2, Pythia, and LLAMA through integration with Hugging Face transformers and datasets. BoCoEL offers a modular design and efficient representation of the corpus to enhance evaluation quality.

github

: 270

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 855

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.3k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 30.6k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675

LLMEvaluation

README:

Evaluation of LLM and LLM based Systems

Compendium of LLM Evaluation methods

Introduction

The github repository

Table of contents

Reviews and Surveys

Leaderboards and Arenas

Evaluation Software

LLM Evaluation articles in tech media and blog posts and podcasts from companies

Large benchmarks

Evaluation of evaluation, Evaluation theory, evaluation methods, analysis of evaluation

Long Comprehensive Studies

HITL (Human in the Loop)

LLM as Judge

LLM Evaluation

Embeddings

In Context Learning

Hallucinations

Question answering

Multi Turn

Reasoning

Multi-Lingual

Multi-Lingual Embedding tasks

Multi-Modal

Instruction Following

Ethical AI

Biases

Safe AI

Cybersecurity

Code Generating LLMs

Summarization

LLM quality (generic methods: overfitting, redundant layers etc)

Inference Performance

Agent LLM Architectures

Long Text Generation

Graph understanding

Reward Models

Various unclassified tasks

LLM Systems

RAG Evaluation

Conversational systems

Copilots

Search and Recommendation Engines

Task Utility

Verticals

Healthcare and medicine

Law

Science

Math

Financial

Other

Other Collections

Citation

For Tasks:

For Jobs:

Alternative AI tools for LLMEvaluation

Similar Open Source Tools

LLMEvaluation

LLM-PLSE-paper

interpret

Awesome-LLM-Prune

vector-search-class-notes

awesome-sound_event_detection

LMOps

Agent

AI-PhD-S25

AI-PhD-S24

ChatLaw

VisionLLM

mlcourse.ai

ByteMLPerf

awesome-ml-gen-ai-elixir

LLMs-at-DoD

For similar tasks

LLMEvaluation

AlignBench

writing

hallucination-leaderboard