LLMEvaluation

LLMEvaluation

A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.

Stars: 55

Visit
 screenshot

The LLMEvaluation repository is a comprehensive compendium of evaluation methods for Large Language Models (LLMs) and LLM-based systems. It aims to assist academics and industry professionals in creating effective evaluation suites tailored to their specific needs by reviewing industry practices for assessing LLMs and their applications. The repository covers a wide range of evaluation techniques, benchmarks, and studies related to LLMs, including areas such as embeddings, question answering, multi-turn dialogues, reasoning, multi-lingual tasks, ethical AI, biases, safe AI, code generation, summarization, software performance, agent LLM architectures, long text generation, graph understanding, and various unclassified tasks. It also includes evaluations for LLM systems in conversational systems, copilots, search and recommendation engines, task utility, and verticals like healthcare, law, science, financial, and others. The repository provides a wealth of resources for evaluating and understanding the capabilities of LLMs in different domains.

README:

Evaluation of LLM and LLM based Systems

Compendium of LLM Evaluation methods


Introduction

The aim of this compendium is to assist academics and industry professionals in creating effective evaluation suites tailored to their specific needs. It does so by reviewing the top industry practices for assessing large language models (LLMs) and their applications. This work goes beyond merely cataloging benchmarks and evaluation studies; it encompasses a comprehensive overview of all effective and practical evaluation techniques, including those embedded within papers that primarily introduce new LLM methodologies and tasks. I plan to periodically update this survey with any noteworthy and shareable evaluation methods that I come across. I aim to create a resource that will enable anyone with queries—whether it's about evaluating a large language model (LLM) or an LLM application for specific tasks, determining the best methods to assess LLM effectiveness, or understanding how well an LLM performs in a particular domain—to easily find all the relevant information needed for these tasks. Additionally, I want to highlight various methods for evaluating the evaluation tasks themselves, to ensure that these evaluations align effectively with business or academic objectives.

My view on LLM Evaluation: Deck, and SF Big Analytics and AICamp video Analytics Vidhya (Data Phoenix Mar 5) (by Andrei Lopatenko) Main repository

Evals are surprisingly often all you need

Table of contents


Reviews and Surveys

  • Evaluating Large Language Models: A Comprehensive Survey , Oct 2023 arxiv:
  • A Survey on Evaluation of Large Language Models Jul 2023 arxiv:
  • Through the Lens of Core Competency: Survey on Evaluation of Large Language Models, Aug 2023 , arxiv:

Leaderboards and Arenas


Evaluation Software

---

LLM Evaluation articles in tech media and blog posts from companies


Large benchmarks

  • MMLU Pro Massive Multitask Language Understanding - Pro version, Jun 2024, arxiv

  • Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks EMNLP 2022, pdf

  • Measuring Massive Multitask Language Understanding, MMLU, ICLR, 2021, arxiv MMLU dataset

  • BigBench: Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models, 2022, arxiv, datasets

  • Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, Oct 2022, arxiv


Evaluation of evaluation, Evaluation theory, evaluation methods, analysis of evaluation

  • Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks, May 2024, ICML 2024, arxiv
  • Lessons from the Trenches on Reproducible Evaluation of Language Models, May 2024, arxiv
  • Synthetic data in evaluation, see Chapter 3 in Best Practices and Lessons Learned on Synthetic Data for Language Models, Apr 2024, arxiv
  • Elo Uncovered: Robustness and Best Practices in Language Model Evaluation, Nov 2023 arxiv
  • When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards, Feb 2024, arxiv
  • Are Emergent Abilities of Large Language Models a Mirage? Apr 23 arxiv
  • Don't Make Your LLM an Evaluation Benchmark Cheater nov 2023 arxiv
  • Evaluating Question Answering Evaluation, 2019, ACL
  • Evaluating Open-QA Evaluation, 2023, arxiv
  • (RE: stat methods ) Prediction-Powered Inference jan 23 arxiv PPI++: Efficient Prediction-Powered Inference nov 23, arxiv
  • Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress, Feb 2024, arxiv
  • Are We on the Right Way for Evaluating Large Vision-Language Models?, Apr 2024, arxiv
  • What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases, Apr 2024, arxiv
  • Detecting Pretraining Data from Large Language Models, Oct 2023, arxiv
  • Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings, Apr 2024, arxiv
  • Faithful model evaluation for model-based metrics, EMNLP 2023, amazon science
  • Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models, ICML 2023, mlr press
  • AI Snake Oil, June 2024, AI leaderboards are no longer useful. It's time to switch to Pareto curves.
  • State of What Art? A Call for Multi-Prompt LLM Evaluation , Aug 2024, Transactions of the Association for Computational Linguistics (2024) 12

Long Comprehensive Studies

  • TrustLLM: Trustworthiness in Large Language Models, Jan 2024, arxiv
  • Evaluating AI systems under uncertain ground truth: a case study in dermatology, Jul 2023, arxiv

HITL (Human in the Loop)

  • Evaluating Question Answering Evaluation, 2019, ACL
  • Developing a Framework for Auditing Large Language Models Using Human-in-the-Loop, Feb 2024, arxiv
  • Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation, Nov 2023, arxiv

LLM as Judge

  • Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models, Apr 2024, arxiv
  • Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries, Sep 2024, arxiv
  • Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena Jun 2023, arxiv
  • Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges, Jum 2024, arxiv
  • Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators, Apr 2024, arxiv leaderboard code
  • Discovering Language Model Behaviors with Model-Written Evaluations, Dec 2022, arxiv
  • The Generative AI Paradox on Evaluation: What It Can Solve, It May Not Evaluate, Feb 2024, arxiv
  • Benchmarking Foundation Models with Language-Model-as-an-Examiner, 2022, NEURIPS
  • Red Teaming Language Models with Language Models, Feb 2022, arxiv
  • ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate, Aug 2023, arxiv
  • ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning, Sep 2023, arxiv
  • Style Over Substance: Evaluation Biases for Large Language Models, Jul 2023, arxiv
  • Large Language Models Are State-of-the-Art Evaluators of Translation Quality, Feb 2023, arxiv
  • Large Language Models Are State-of-the-Art Evaluators of Code Generation, Apr 2023, researchgate
  • Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators, Mar 2024, arxiv
  • LLM Evaluators Recognize and Favor Their Own Generations, Apr 2024, pdf
  • Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences, Apr 2024, arxiv
  • Using LLMs for Evaluation LLM-as-a-Judge and other scalable additions to human quality ratings. Aug 2024, Deep Learning Focus


LLM Evaluation

Embeddings


In Context Learning


Hallucinations

  • INVITE: A testbed of automatically generated invalid questions to evaluate large language models for hallucinations, EMNLP 2023, amazon science
  • HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models, Dec 2023, ACL
  • Long-form factuality in large language models, Mar 2024, arxiv
  • Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models, Sep 2023, arxiv
  • Measuring Faithfulness in Chain-of-Thought Reasoning, Jul 2023, [arxiv
  • FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation, May 2023, arxiv repository

Question answering

QA is used in many vertical domains, see Vertical section bellow

  • CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge, Jun 2019, ACL
  • Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering, Sep 2018, arxiv OpenBookQA dataset at AllenAI
  • Jin, Di, et al. "What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams., 2020, arxiv MedQA
  • Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, 2018, arxiv ARC Easy dataset ARC dataset
  • BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions, 2019, arxiv BoolQ dataset
  • HellaSwag, HellaSwag: Can a Machine Really Finish Your Sentence? 2019, arxiv Paper + code + dataset https://rowanzellers.com/hellaswag/
  • PIQA: Reasoning about Physical Commonsense in Natural Language, Nov 2019, arxiv PIQA dataset
  • Crowdsourcing Multiple Choice Science Questions arxiv SciQ dataset
  • WinoGrande: An Adversarial Winograd Schema Challenge at Scale, 2017, arxiv Winogrande dataset
  • TruthfulQA: Measuring How Models Mimic Human Falsehoods, Sep 2021, arxiv
  • TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages, 2020, arxiv data
  • Natural Questions: A Benchmark for Question Answering Research, Transactions ACL 2019

Multi Turn

  • LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models Nov 2023, arxiv
  • MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues Feb 24 arxiv
  • How Well Can LLMs Negotiate? NEGOTIATIONARENA Platform and Analysis Feb 2024 arxiv
  • MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback, Sep 2023, arxiv

Reasoning

  • Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks 2023, arxiv
  • LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models, arxiv
  • Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering, Feb 24, arxiv
  • Competition-Level Problems are Effective LLM Evaluators, Dec 23, arxiv
  • Eyes Can Deceive: Benchmarking Counterfactual Reasoning Capabilities of Multimodal Large Language Models, Apr 2024, arxiv
  • MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning, Oct 2023, arxiv

Multi-Lingual

  • AlGhafa Evaluation Benchmark for Arabic Language Models Dec 23, ACL Anthology ACL pdf article
  • Introducing the Open Ko-LLM Leaderboard: Leading the Korean LLM Evaluation Ecosystem HF blog
  • Heron-Bench: A Benchmark for Evaluating Vision Language Models in Japanese , Apr 2024 arxiv
  • AlignBench: Benchmarking Chinese Alignment of Large Language Models, Nov 2023, arxiv
  • The Invalsi Benchmark: measuring Language Models Mathematical and Language understanding in Italian, Mar 2024, arxiv
  • MEGA: Multilingual Evaluation of Generative AI, Mar 2023, arxiv
  • M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models, 2023, NIPS website
  • LAraBench: Benchmarking Arabic AI with Large Language Models, May 23, arxiv
  • Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?, Apr 2024, arxiv

Multi-Lingual Embedding tasks

  • The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding, openreview pdf

  • Chinese MTEB C-MTEB repository

  • French MTEB repository

  • C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models, May 2023, arxiv


Multi-Modal

  • Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models, Reka AI, May 2024 arxiv dataset blog post

  • Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis, Aug 2024, arxiv

  • CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models, Jun 2024, arxiv

  • EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models, Jun 2024, arxiv

  • MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models, Jun 2024, arxiv

  • Holistic Evaluation of Text-to-Image Models Nov 23 arxiv

  • VBench: Comprehensive Benchmark Suite for Video Generative Models Nov 23 arxiv

  • Evaluating Text-to-Visual Generation with Image-to-Text Generation, Apr 2024, arxiv

  • What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases, Apr 2024, arxiv

  • Are We on the Right Way for Evaluating Large Vision-Language Models?, Apr 2024, arxiv

  • MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning, Nov 2023, arxiv

  • BLINK: Multimodal Large Language Models Can See but Not Perceive, Apr 2024, arxiv github

  • Eyes Can Deceive: Benchmarking Counterfactual Reasoning Capabilities of Multimodal Large Language Models, Apr 2024, arxiv

  • Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings, Apr 2024, arxiv

  • VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models, Apr 2024, arxiv

  • Evaluation part of https://arxiv.org/abs/2404.18930, Apr 2024, arxiv, repository


Instruction Following

  • Evaluating Large Language Models at Evaluating Instruction Following Oct 2023, arxiv

  • Instruction-Following Evaluation for Large Language Models, IFEval, Nov 2023, arxiv

  • FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets, Jul 2023, arxiv , FLASK dataset

  • DINGO: Towards Diverse and Fine-Grained Instruction-Following Evaluation, Mar 2024, aaai, pdf

  • LongForm: Effective Instruction Tuning with Reverse Instructions, Apr 2023, arxiv dataset


Ethical AI

  • Evaluating the Moral Beliefs Encoded in LLMs, Jul 23 arxiv

  • AI Deception: A Survey of Examples, Risks, and Potential Solutions Aug 23 arxiv

  • Aligning AI With Shared Human Value, Aug 20 - Feb 23, arxiv Re: ETHICS benchmark

  • What are human values, and how do we align AI to them?, Mar 2024, pdf

  • TrustLLM: Trustworthiness in Large Language Models, Jan 2024, arxiv

  • Helpfulness, Honesty, Harmlessness (HHH) framework from Antrhtopic, introduced in A General Language Assistantas a Laboratory for Alignment, 2021, arxiv, it's in BigBench now bigbench

  • WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models, April 2024, arxiv

  • Chapter 19 in The Ethics of Advanced AI Assistants, Apr 2024, Google DeepMind, pdf at google

  • BEHONEST: Benchmarking Honesty of Large Language Models, June 2024, arxiv


Biases

  • FairPair: A Robust Evaluation of Biases in Language Models through Paired Perturbations, Apr 2024 arxiv

  • BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation, 2021, arxiv, dataset

  • “I’m fully who I am”: Towards centering transgender and non-binary voices to measure biases in open language generation, ACM FAcct 2023, amazon science

  • This Land is {Your, My} Land: Evaluating Geopolitical Biases in Language Models, May 2023, arxiv


Safe AI

  • Explore, Establish, Exploit: Red Teaming Language Models from Scratch, Jun 2023, arxiv
  • Rethinking Backdoor Detection Evaluation for Language Models, Aug 2024, arxiv pdf
  • Gradient-Based Language Model Red Teaming, Jan 24, arxiv
  • JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models, Mar 2024, arxiv
  • Announcing a Benchmark to Improve AI Safety MLCommons has made benchmarks for AI performance—now it's time to measure safety, Apr 2024 IEEE Spectrum
  • Model evaluation for extreme risks, May 2023, arxiv
  • Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, Jan 2024, arxiv

Code Generating LLMs

  • Evaluating Large Language Models Trained on Code HumanEval Jul 2022 arxiv
  • CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation Feb 21 arxiv
  • Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming Feb 24 arxiv
  • SWE Bench SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Feb 2024 arxiv Tech Report
  • Gorilla Functional Calling Leaderboard, Berkeley Leaderboard
  • DevBench: A Comprehensive Benchmark for Software Development, Mar 2024,arxiv
  • MBPP (Mostly Basic Python Programming) benchmark, introduced in Program Synthesis with Large Language Models , 2021 papers with code data
  • CodeMind: A Framework to Challenge Large Language Models for Code Reasoning, Feb 2024, arxiv
  • CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution, Jan 2024, arxiv
  • CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning, Jul 2022, arxiv code at salesforce github

Summarization

  • Human-like Summarization Evaluation with ChatGPT, Apr 2023, arxiv
  • WikiAsp: A Dataset for Multi-domain Aspect-based Summarization, 2021, Transactions ACL dataset

LLM quality (generic methods: overfitting, redundant layers etc)


Software Performance


Agent LLM Architectures

  • Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and Execution of LLM Agents in an Auction Arena, Oct 2023, arxiv
  • LLM-Deliberation: Evaluating LLMs with Interactive Multi-Agent Negotiation Games, Sep 2023,arxiv
  • AgentBench: Evaluating LLMs as Agents, Aug 2023, arxiv
  • How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments, Mar 2024, arxiv
  • R-Judge: Benchmarking Safety Risk Awareness for LLM Agents, Jan 2024, arxiv

Long Text Generation

  • Suri: Multi-constraint Instruction Following for Long-form Text Generation, Jun 2024, arxiv
  • LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs, Aug 2024, arxiv
  • LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding, Aug 2023, arxiv
  • HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models, Sep 2024, arxiv

Graph understanding

  • GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and Benchmarking, May 2023, arxiv
  • LLM4DyG: Can Large Language Models Solve Spatial-Temporal Problems on Dynamic Graphs? Oct 2023, arxiv
  • Talk like a Graph: Encoding Graphs for Large Language Models, Oct 2023, arxiv
  • Open Graph Benchmark: Datasets for Machine Learning on Graphs, NeurIPS 2020
  • Can Language Models Solve Graph Problems in Natural Language? NeurIPS 2023
  • Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis, Aug 2023, [https://arxiv.org/abs/2308.11224]

Various unclassified tasks

(TODO as there are more than three papers per class, make a class a separate chapter in this Compendium)

  • DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph, June 2024, arxiv
  • Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study Mar 24, WSDM 24, ms blog
  • LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models, jul 2023 arxiv
  • OpenEQA: From word models to world models, Meta, Apr 2024, Understanding physical spaces by Models, Meta AI blog
  • Is Your LLM Outdated? Benchmarking LLMs & Alignment Algorithms for Time-Sensitive Knowledge. Apr 2024, arxiv
  • ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models, Apr 2024, arxiv
  • LongEmbed: Extending Embedding Models for Long Context Retrieval, Apr 2024, arxiv, benchmark for long context tasks, repository for LongEmbed
  • Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT, Feb 2024, arxiv, LoCoV1 benchmark for long context LLM,
  • A User-Centric Benchmark for Evaluating Large Language Models, Apr 2024, arxiv, data of user centric benchmark at github
  • RACE: Large-scale ReAding Comprehension Dataset From Examinations, 2017, arxiv RACE dataset at CMU
  • CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models, 2020, arxiv CrowS-Pairs dataset
  • DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs, Jun 2019, ACL data
  • RewardBench: Evaluating Reward Models for Language Modeling, Mar 2024, arxiv
  • Toward informal language processing: Knowledge of slang in large language models, EMNLP 2023, amazon science
  • FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability, Feb 2024, arxiv
  • Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs, 05 2024,Bird, a big benchmark for large-scale database grounded in text-to-SQL tasks, containing 12,751 pairs of text-to-SQL data and 95 databases with a total size of 33.4 GB, spanning 37 professional domain arxiv data and leaderboard
  • MuSiQue: Multihop Questions via Single-hop Question Composition, Aug 2021, arxiv
  • Evaluating Copyright Takedown Methods for Language Models, June 2024, arxiv


LLM Systems

RAG Evaluation


Conversational systems

And Dialog systems

  • Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI Feb 24, Nature

  • CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems, Jun 2024, arxiv

  • Simulated user feedback for the LLM production, TDS

  • How Well Can LLMs Negotiate? NEGOTIATIONARENA Platform and Analysis Feb 2024 arxiv

  • Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs, Apr 2024, arxiv

  • A Two-dimensional Zero-shot Dialogue State Tracking Evaluation Method using GPT-4, Jun 2024, arxiv


Copilots

  • Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming Feb 24 arxiv
  • ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models, Apr 2024, arxiv

Search and Recommendation Engines

  • Is ChatGPT a Good Recommender? A Preliminary Study Apr 2023 arxiv

  • LaMP: When Large Language Models Meet Personalization, Apr 2023, arxiv

  • BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives, Feb 2024, arxiv

  • Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents, Apr 2023, arxiv

  • BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models, Oct 2021, arxiv

  • BENCHMARK : LoTTE, Long-Tail Topic-stratified Evaluation for IR that features 12 domain-specific search tests, spanning StackExchange communities and using queries from GooAQ, ColBERT repository wth the benchmark data

  • LongEmbed: Extending Embedding Models for Long Context Retrieval, Apr 2024, arxiv, benchmark for long context tasks, repository for LongEmbed

  • Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT, Feb 2024, arxiv, LoCoV1 benchmark for long context LLM,

  • STARK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases, Apr 2024, arxiv code github

  • Constitutional AI: Harmlessness from AI Feedback, Sep 2022 arxiv (See Appendix B Identifying and Classifying Harmful Conversations, other parts)


Task Utility

  • Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications, Feb 2024, arxiv


Verticals

Healthcare and medicine

  • Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI Feb 24, Nature
  • Evaluating Generative AI Responses to Real-world Drug-Related Questions, June 2024, Psychiatry Research
  • Clinical Insights: A Comprehensive Review of Language Models in Medicine, Aug 2024, arxiv See table 2 for evaluation
  • Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data Jan 2024 arxiv
  • Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis, Jan 2024, arxiv
  • MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering, 2022, PMLR
  • What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams, MedQA benchmark, Sep 2020, arxiv
  • PubMedQA: A Dataset for Biomedical Research Question Answering, 2019, acl
  • Open Medical LLM Leaderboard from HF Explanation
  • Evaluating Large Language Models on a Highly-specialized Topic, Radiation Oncology Physics, Apr 2023, arxiv
  • Assessing the Accuracy of Responses by the Language Model ChatGPT to Questions Regarding Bariatric Surgery, Apr 2023, pub med
  • Can LLMs like GPT-4 outperform traditional AI tools in dementia diagnosis? Maybe, but not today, Jun 2023, arxiv
  • Evaluating the use of large language model in identifying top research questions in gastroenterology, Mar 2023, nature
  • Evaluating AI systems under uncertain ground truth: a case study in dermatology, Jul 2023, arxiv
  • MedDialog: Two Large-scale Medical Dialogue Datasets, Apr 2020, arxiv
  • An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition, 2015, article html
  • DrugBank 5.0: a major update to the DrugBank database for 2018, 2018, paper html]
  • A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models, May 2024, nature, dataset
  • MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records, Aug 2023, arxiv

Law

  • LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models, NeurIPS 2023
  • LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain, EMNLP 2023
  • Multi-LexSum: Real-world Summaries of Civil Rights Lawsuits at Multiple Granularities NeurIPS 2022

Science

  • SciRepEval: A Multi-Format Benchmark for Scientific Document Representations, 2022, arxiv

  • GPQA: A Graduate-Level Google-Proof Q&A Benchmark, Nov 2023, arxiv

  • MATH Mathematics Aptitude Test of Heuristics, Measuring Mathematical Problem Solving With the MATH Dataset, Nov 2021 arxiv

Math

  • How well do large language models perform in arithmetic tasks?, Mar 2023, arxiv
  • Cmath: Can your language model pass chinese elementary school math test?, Jun 23, arxiv
  • GSM8K paperwithcode repository github

Financial

  • Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering, Feb 24, arxiv
  • PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance, Jun 2023, arxiv
  • BloombergGPT: A Large Language Model for Finance (see Chapter 5 Evaluation), Mar 2023, arxiv
  • FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets, Oct 2023, arxiv

Other

  • Understanding the Capabilities of Large Language Models for Automated Planning, May 2023, arxiv

Other Collections

Citation

@article{Lopatenko2024CompendiumLLMEvaluation,
  title   = {Compendium of LLM Evaluation methods},
  author  = {Lopatenko, Andrei},
  year    = {2024},
  note    = {\url{https://github.com/alopatenko/LLMEvaluation}}
}

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for LLMEvaluation

Similar Open Source Tools

For similar tasks

For similar jobs