LLMEvaluation
A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.
Stars: 55
The LLMEvaluation repository is a comprehensive compendium of evaluation methods for Large Language Models (LLMs) and LLM-based systems. It aims to assist academics and industry professionals in creating effective evaluation suites tailored to their specific needs by reviewing industry practices for assessing LLMs and their applications. The repository covers a wide range of evaluation techniques, benchmarks, and studies related to LLMs, including areas such as embeddings, question answering, multi-turn dialogues, reasoning, multi-lingual tasks, ethical AI, biases, safe AI, code generation, summarization, software performance, agent LLM architectures, long text generation, graph understanding, and various unclassified tasks. It also includes evaluations for LLM systems in conversational systems, copilots, search and recommendation engines, task utility, and verticals like healthcare, law, science, financial, and others. The repository provides a wealth of resources for evaluating and understanding the capabilities of LLMs in different domains.
README:
The aim of this compendium is to assist academics and industry professionals in creating effective evaluation suites tailored to their specific needs. It does so by reviewing the top industry practices for assessing large language models (LLMs) and their applications. This work goes beyond merely cataloging benchmarks and evaluation studies; it encompasses a comprehensive overview of all effective and practical evaluation techniques, including those embedded within papers that primarily introduce new LLM methodologies and tasks. I plan to periodically update this survey with any noteworthy and shareable evaluation methods that I come across. I aim to create a resource that will enable anyone with queries—whether it's about evaluating a large language model (LLM) or an LLM application for specific tasks, determining the best methods to assess LLM effectiveness, or understanding how well an LLM performs in a particular domain—to easily find all the relevant information needed for these tasks. Additionally, I want to highlight various methods for evaluating the evaluation tasks themselves, to ensure that these evaluations align effectively with business or academic objectives.
My view on LLM Evaluation: Deck, and SF Big Analytics and AICamp video Analytics Vidhya (Data Phoenix Mar 5) (by Andrei Lopatenko) Main repository
- Reviews and Surveys
- Leaderboards and Arenas
- Evaluation Software
- LLM Evaluation articles in tech media and blog posts from companies
- Large benchmarks
- Evaluation of evaluation, Evaluation theory, evaluation methods, analysis of evaluation
- Long Comprehensive Studies
- HITL (Human in the Loop)
- LLM as Judge
-
LLM Evaluation
- Embeddings
- In Context Learning
- Hallucinations
- Question Answering
- Multi Turn
- Reasoning
- Multi-Lingual
- Multi-Modal
- Instruction Following
- Ethical AI
- Biases
- Safe AI
- Code Generating LLMs
- Summarization
- LLM quality (generic methods: overfitting, redundant layers etc)
- Software Performace (latency, throughput, memory, storage)
- Agent LLM architectures
- Long Text Generation
- Graph Understandings
- Various unclassified tasks
- LLM Systems
- Other collections
- Citation
- Evaluating Large Language Models: A Comprehensive Survey , Oct 2023 arxiv:
- A Survey on Evaluation of Large Language Models Jul 2023 arxiv:
- Through the Lens of Core Competency: Survey on Evaluation of Large Language Models, Aug 2023 , arxiv:
- New Hard Leaderboard by HuggingFace leaderboard description, blog post
- LMSys Arena (explanation:)
- Salesforce's Contextual Bench leaderboard hugging face an overview of how different LLMs perform across a variety of contextual tasks,
- OpenLLM Leaderboard
- MTEB
- SWE Bench
- AlpacaEval leaderboard Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators, Apr 2024, arxiv code
- Open Medical LLM Leaderboard from HF Explanation
- Gorilla, Berkeley function calling Leaderboard Explanation
- WildBench WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
- Enterprise Scenarios, Patronus
- Vectara Hallucination Leaderboard
- Ray/Anyscale's LLM Performance Leaderboard (explanation:)
- Hugging Face LLM Performance hugging face leaderboard
- Multi-task Language Understanding on MMLU
- EleutherAI LLM Evaluation Harness
- OpenAI Evals
- ConfidentAI DeepEval
- MTEB
- OpenICL Framework
- RAGAS
- ML Flow Evaluate
- MosaicML Composer
- Toolkit from Mozilla AI for LLM as judge evaluation tool: lm-buddy eval tool model: Prometheus
- TruLens
- Promptfoo
- BigCode Evaluation Harness
- LangFuse
- LLMeBench see LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking
- ChainForge
- Ironclad Rivet
- LM-PUB-QUIZ: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models, arxiv pdf github repository
---
- Evaluate LLMs using Evaluation Harness and Hugging Face TGI/vLLM, Sep 2024, blog
- Let's talk about LLM Evaluation, HuggingFace, article
- Using LLMs for Evaluation LLM-as-a-Judge and other scalable additions to human quality ratings. Aug 2024, Deep Learning Focus
- Catch me if you can! How to beat GPT-4 with a 13B model, LM sys org
- Why it’s impossible to review AIs, and why TechCrunch is doing it anyway Techcrun mat 2024
- A.I. has a measurement problem, NY Times, Apr 2024
- Beyond Accuracy: The Changing Landscape Of AI Evaluation, Forbes, Mar 2024
- Mozilla AI Exploring LLM Evaluation at scale
- Evaluation part of How to Maximize LLM Performance
- Mozilla AI blog published multiple good articles in Mozilla AI blog
- Andrej Karpathy on evaluation X
- From Meta on evaluation of Llama 3 models github
- DeepMind AI Safety evaluation June 24 deepmind blog, Introducing Frontier Safety Framework
- AI Snake Oil, June 2024, AI leaderboards are no longer useful. It's time to switch to Pareto curves.
- Hamel Dev March 2024, Your AI Product Needs Eval. How to construct domain-specific LLM evaluation systems
-
MMLU Pro Massive Multitask Language Understanding - Pro version, Jun 2024, arxiv
-
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks EMNLP 2022, pdf
-
Measuring Massive Multitask Language Understanding, MMLU, ICLR, 2021, arxiv MMLU dataset
-
BigBench: Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models, 2022, arxiv, datasets
-
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, Oct 2022, arxiv
- Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks, May 2024, ICML 2024, arxiv
- Lessons from the Trenches on Reproducible Evaluation of Language Models, May 2024, arxiv
- Synthetic data in evaluation, see Chapter 3 in Best Practices and Lessons Learned on Synthetic Data for Language Models, Apr 2024, arxiv
- Elo Uncovered: Robustness and Best Practices in Language Model Evaluation, Nov 2023 arxiv
- When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards, Feb 2024, arxiv
- Are Emergent Abilities of Large Language Models a Mirage? Apr 23 arxiv
- Don't Make Your LLM an Evaluation Benchmark Cheater nov 2023 arxiv
- Evaluating Question Answering Evaluation, 2019, ACL
- Evaluating Open-QA Evaluation, 2023, arxiv
- (RE: stat methods ) Prediction-Powered Inference jan 23 arxiv PPI++: Efficient Prediction-Powered Inference nov 23, arxiv
- Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress, Feb 2024, arxiv
- Are We on the Right Way for Evaluating Large Vision-Language Models?, Apr 2024, arxiv
- What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases, Apr 2024, arxiv
- Detecting Pretraining Data from Large Language Models, Oct 2023, arxiv
- Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings, Apr 2024, arxiv
- Faithful model evaluation for model-based metrics, EMNLP 2023, amazon science
- Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models, ICML 2023, mlr press
- AI Snake Oil, June 2024, AI leaderboards are no longer useful. It's time to switch to Pareto curves.
- State of What Art? A Call for Multi-Prompt LLM Evaluation , Aug 2024, Transactions of the Association for Computational Linguistics (2024) 12
- TrustLLM: Trustworthiness in Large Language Models, Jan 2024, arxiv
- Evaluating AI systems under uncertain ground truth: a case study in dermatology, Jul 2023, arxiv
- Evaluating Question Answering Evaluation, 2019, ACL
- Developing a Framework for Auditing Large Language Models Using Human-in-the-Loop, Feb 2024, arxiv
- Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation, Nov 2023, arxiv
- Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models, Apr 2024, arxiv
- Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries, Sep 2024, arxiv
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena Jun 2023, arxiv
- Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges, Jum 2024, arxiv
- Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators, Apr 2024, arxiv leaderboard code
- Discovering Language Model Behaviors with Model-Written Evaluations, Dec 2022, arxiv
- The Generative AI Paradox on Evaluation: What It Can Solve, It May Not Evaluate, Feb 2024, arxiv
- Benchmarking Foundation Models with Language-Model-as-an-Examiner, 2022, NEURIPS
- Red Teaming Language Models with Language Models, Feb 2022, arxiv
- ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate, Aug 2023, arxiv
- ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning, Sep 2023, arxiv
- Style Over Substance: Evaluation Biases for Large Language Models, Jul 2023, arxiv
- Large Language Models Are State-of-the-Art Evaluators of Translation Quality, Feb 2023, arxiv
- Large Language Models Are State-of-the-Art Evaluators of Code Generation, Apr 2023, researchgate
- Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators, Mar 2024, arxiv
- LLM Evaluators Recognize and Favor Their Own Generations, Apr 2024, pdf
- Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences, Apr 2024, arxiv
- Using LLMs for Evaluation LLM-as-a-Judge and other scalable additions to human quality ratings. Aug 2024, Deep Learning Focus
- MTEB: Massive Text Embedding Benchmark Oct 2022 [arxiv](https://arxiv.org/abs/2210.07316 Leaderboard) Leaderboard
- The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding, openreview pdf
- MMTEB: Community driven extension to MTEB repository
- Chinese MTEB C-MTEB repository
- French MTEB repository
- HellaSwag, HellaSwag: Can a Machine Really Finish Your Sentence? 2019, arxiv Paper + code + dataset https://rowanzellers.com/hellaswag/
- The LAMBADA dataset: Word prediction requiring a broad discourse context 2016, arxiv
- INVITE: A testbed of automatically generated invalid questions to evaluate large language models for hallucinations, EMNLP 2023, amazon science
- HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models, Dec 2023, ACL
- Long-form factuality in large language models, Mar 2024, arxiv
- Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models, Sep 2023, arxiv
- Measuring Faithfulness in Chain-of-Thought Reasoning, Jul 2023, [arxiv
- FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation, May 2023, arxiv repository
QA is used in many vertical domains, see Vertical section bellow
- CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge, Jun 2019, ACL
- Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering, Sep 2018, arxiv OpenBookQA dataset at AllenAI
- Jin, Di, et al. "What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams., 2020, arxiv MedQA
- Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, 2018, arxiv ARC Easy dataset ARC dataset
- BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions, 2019, arxiv BoolQ dataset
- HellaSwag, HellaSwag: Can a Machine Really Finish Your Sentence? 2019, arxiv Paper + code + dataset https://rowanzellers.com/hellaswag/
- PIQA: Reasoning about Physical Commonsense in Natural Language, Nov 2019, arxiv PIQA dataset
- Crowdsourcing Multiple Choice Science Questions arxiv SciQ dataset
- WinoGrande: An Adversarial Winograd Schema Challenge at Scale, 2017, arxiv Winogrande dataset
- TruthfulQA: Measuring How Models Mimic Human Falsehoods, Sep 2021, arxiv
- TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages, 2020, arxiv data
- Natural Questions: A Benchmark for Question Answering Research, Transactions ACL 2019
- LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models Nov 2023, arxiv
- MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues Feb 24 arxiv
- How Well Can LLMs Negotiate? NEGOTIATIONARENA Platform and Analysis Feb 2024 arxiv
- MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback, Sep 2023, arxiv
- Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks 2023, arxiv
- LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models, arxiv
- Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering, Feb 24, arxiv
- Competition-Level Problems are Effective LLM Evaluators, Dec 23, arxiv
- Eyes Can Deceive: Benchmarking Counterfactual Reasoning Capabilities of Multimodal Large Language Models, Apr 2024, arxiv
- MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning, Oct 2023, arxiv
- AlGhafa Evaluation Benchmark for Arabic Language Models Dec 23, ACL Anthology ACL pdf article
- Introducing the Open Ko-LLM Leaderboard: Leading the Korean LLM Evaluation Ecosystem HF blog
- Heron-Bench: A Benchmark for Evaluating Vision Language Models in Japanese , Apr 2024 arxiv
- AlignBench: Benchmarking Chinese Alignment of Large Language Models, Nov 2023, arxiv
- The Invalsi Benchmark: measuring Language Models Mathematical and Language understanding in Italian, Mar 2024, arxiv
- MEGA: Multilingual Evaluation of Generative AI, Mar 2023, arxiv
- M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models, 2023, NIPS website
- LAraBench: Benchmarking Arabic AI with Large Language Models, May 23, arxiv
- Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?, Apr 2024, arxiv
-
The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding, openreview pdf
-
Chinese MTEB C-MTEB repository
-
French MTEB repository
-
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models, May 2023, arxiv
-
Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models, Reka AI, May 2024 arxiv dataset blog post
-
Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis, Aug 2024, arxiv
-
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models, Jun 2024, arxiv
-
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models, Jun 2024, arxiv
-
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models, Jun 2024, arxiv
-
Holistic Evaluation of Text-to-Image Models Nov 23 arxiv
-
VBench: Comprehensive Benchmark Suite for Video Generative Models Nov 23 arxiv
-
Evaluating Text-to-Visual Generation with Image-to-Text Generation, Apr 2024, arxiv
-
What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases, Apr 2024, arxiv
-
Are We on the Right Way for Evaluating Large Vision-Language Models?, Apr 2024, arxiv
-
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning, Nov 2023, arxiv
-
BLINK: Multimodal Large Language Models Can See but Not Perceive, Apr 2024, arxiv github
-
Eyes Can Deceive: Benchmarking Counterfactual Reasoning Capabilities of Multimodal Large Language Models, Apr 2024, arxiv
-
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings, Apr 2024, arxiv
-
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models, Apr 2024, arxiv
-
Evaluation part of https://arxiv.org/abs/2404.18930, Apr 2024, arxiv, repository
-
Evaluating Large Language Models at Evaluating Instruction Following Oct 2023, arxiv
-
Instruction-Following Evaluation for Large Language Models, IFEval, Nov 2023, arxiv
-
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets, Jul 2023, arxiv , FLASK dataset
-
DINGO: Towards Diverse and Fine-Grained Instruction-Following Evaluation, Mar 2024, aaai, pdf
-
LongForm: Effective Instruction Tuning with Reverse Instructions, Apr 2023, arxiv dataset
-
Evaluating the Moral Beliefs Encoded in LLMs, Jul 23 arxiv
-
AI Deception: A Survey of Examples, Risks, and Potential Solutions Aug 23 arxiv
-
Aligning AI With Shared Human Value, Aug 20 - Feb 23, arxiv Re: ETHICS benchmark
-
What are human values, and how do we align AI to them?, Mar 2024, pdf
-
TrustLLM: Trustworthiness in Large Language Models, Jan 2024, arxiv
-
Helpfulness, Honesty, Harmlessness (HHH) framework from Antrhtopic, introduced in A General Language Assistantas a Laboratory for Alignment, 2021, arxiv, it's in BigBench now bigbench
-
WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models, April 2024, arxiv
-
Chapter 19 in The Ethics of Advanced AI Assistants, Apr 2024, Google DeepMind, pdf at google
-
BEHONEST: Benchmarking Honesty of Large Language Models, June 2024, arxiv
-
FairPair: A Robust Evaluation of Biases in Language Models through Paired Perturbations, Apr 2024 arxiv
-
BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation, 2021, arxiv, dataset
-
“I’m fully who I am”: Towards centering transgender and non-binary voices to measure biases in open language generation, ACM FAcct 2023, amazon science
-
This Land is {Your, My} Land: Evaluating Geopolitical Biases in Language Models, May 2023, arxiv
- Explore, Establish, Exploit: Red Teaming Language Models from Scratch, Jun 2023, arxiv
- Rethinking Backdoor Detection Evaluation for Language Models, Aug 2024, arxiv pdf
- Gradient-Based Language Model Red Teaming, Jan 24, arxiv
- JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models, Mar 2024, arxiv
- Announcing a Benchmark to Improve AI Safety MLCommons has made benchmarks for AI performance—now it's time to measure safety, Apr 2024 IEEE Spectrum
- Model evaluation for extreme risks, May 2023, arxiv
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, Jan 2024, arxiv
- Evaluating Large Language Models Trained on Code HumanEval Jul 2022 arxiv
- CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation Feb 21 arxiv
- Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming Feb 24 arxiv
- SWE Bench SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Feb 2024 arxiv Tech Report
- Gorilla Functional Calling Leaderboard, Berkeley Leaderboard
- DevBench: A Comprehensive Benchmark for Software Development, Mar 2024,arxiv
- MBPP (Mostly Basic Python Programming) benchmark, introduced in Program Synthesis with Large Language Models , 2021 papers with code data
- CodeMind: A Framework to Challenge Large Language Models for Code Reasoning, Feb 2024, arxiv
- CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution, Jan 2024, arxiv
- CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning, Jul 2022, arxiv code at salesforce github
- Human-like Summarization Evaluation with ChatGPT, Apr 2023, arxiv
- WikiAsp: A Dataset for Multi-domain Aspect-based Summarization, 2021, Transactions ACL dataset
- Ray/Anyscale's LLM Performance Leaderboard (explanation:)
- MLCommons MLPerf benchmarks (inference) MLPerf announcement of the LLM track
- Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and Execution of LLM Agents in an Auction Arena, Oct 2023, arxiv
- LLM-Deliberation: Evaluating LLMs with Interactive Multi-Agent Negotiation Games, Sep 2023,arxiv
- AgentBench: Evaluating LLMs as Agents, Aug 2023, arxiv
- How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments, Mar 2024, arxiv
- R-Judge: Benchmarking Safety Risk Awareness for LLM Agents, Jan 2024, arxiv
- Suri: Multi-constraint Instruction Following for Long-form Text Generation, Jun 2024, arxiv
- LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs, Aug 2024, arxiv
- LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding, Aug 2023, arxiv
- HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models, Sep 2024, arxiv
- GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and Benchmarking, May 2023, arxiv
- LLM4DyG: Can Large Language Models Solve Spatial-Temporal Problems on Dynamic Graphs? Oct 2023, arxiv
- Talk like a Graph: Encoding Graphs for Large Language Models, Oct 2023, arxiv
- Open Graph Benchmark: Datasets for Machine Learning on Graphs, NeurIPS 2020
- Can Language Models Solve Graph Problems in Natural Language? NeurIPS 2023
- Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis, Aug 2023, [https://arxiv.org/abs/2308.11224]
(TODO as there are more than three papers per class, make a class a separate chapter in this Compendium)
- DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph, June 2024, arxiv
- Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study Mar 24, WSDM 24, ms blog
- LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models, jul 2023 arxiv
- OpenEQA: From word models to world models, Meta, Apr 2024, Understanding physical spaces by Models, Meta AI blog
- Is Your LLM Outdated? Benchmarking LLMs & Alignment Algorithms for Time-Sensitive Knowledge. Apr 2024, arxiv
- ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models, Apr 2024, arxiv
- LongEmbed: Extending Embedding Models for Long Context Retrieval, Apr 2024, arxiv, benchmark for long context tasks, repository for LongEmbed
- Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT, Feb 2024, arxiv, LoCoV1 benchmark for long context LLM,
- A User-Centric Benchmark for Evaluating Large Language Models, Apr 2024, arxiv, data of user centric benchmark at github
- RACE: Large-scale ReAding Comprehension Dataset From Examinations, 2017, arxiv RACE dataset at CMU
- CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models, 2020, arxiv CrowS-Pairs dataset
- DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs, Jun 2019, ACL data
- RewardBench: Evaluating Reward Models for Language Modeling, Mar 2024, arxiv
- Toward informal language processing: Knowledge of slang in large language models, EMNLP 2023, amazon science
- FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability, Feb 2024, arxiv
- Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs, 05 2024,Bird, a big benchmark for large-scale database grounded in text-to-SQL tasks, containing 12,751 pairs of text-to-SQL data and 95 databases with a total size of 33.4 GB, spanning 37 professional domain arxiv data and leaderboard
- MuSiQue: Multihop Questions via Single-hop Question Composition, Aug 2021, arxiv
- Evaluating Copyright Takedown Methods for Language Models, June 2024, arxiv
- Google Frames Dataset for evaluation of RAG systems, Sep 2024, arxiv paper: Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation Hugging Face, dataset
- RAGAS: Automated Evaluation of Retrieval Augmented Generation Jul 23, arxiv
- ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems Nov 23, arxiv
- Evaluating Retrieval Quality in Retrieval-Augmented Generation, Apr 2024, arxiv
And Dialog systems
-
Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI Feb 24, Nature
-
CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems, Jun 2024, arxiv
-
Simulated user feedback for the LLM production, TDS
-
How Well Can LLMs Negotiate? NEGOTIATIONARENA Platform and Analysis Feb 2024 arxiv
-
Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs, Apr 2024, arxiv
-
A Two-dimensional Zero-shot Dialogue State Tracking Evaluation Method using GPT-4, Jun 2024, arxiv
- Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming Feb 24 arxiv
- ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models, Apr 2024, arxiv
-
Is ChatGPT a Good Recommender? A Preliminary Study Apr 2023 arxiv
-
LaMP: When Large Language Models Meet Personalization, Apr 2023, arxiv
-
BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives, Feb 2024, arxiv
-
Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents, Apr 2023, arxiv
-
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models, Oct 2021, arxiv
-
BENCHMARK : LoTTE, Long-Tail Topic-stratified Evaluation for IR that features 12 domain-specific search tests, spanning StackExchange communities and using queries from GooAQ, ColBERT repository wth the benchmark data
-
LongEmbed: Extending Embedding Models for Long Context Retrieval, Apr 2024, arxiv, benchmark for long context tasks, repository for LongEmbed
-
Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT, Feb 2024, arxiv, LoCoV1 benchmark for long context LLM,
-
STARK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases, Apr 2024, arxiv code github
-
Constitutional AI: Harmlessness from AI Feedback, Sep 2022 arxiv (See Appendix B Identifying and Classifying Harmful Conversations, other parts)
- Towards better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications, Feb 2024, arxiv
- Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI Feb 24, Nature
- Evaluating Generative AI Responses to Real-world Drug-Related Questions, June 2024, Psychiatry Research
- Clinical Insights: A Comprehensive Review of Language Models in Medicine, Aug 2024, arxiv See table 2 for evaluation
- Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data Jan 2024 arxiv
- Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis, Jan 2024, arxiv
- MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering, 2022, PMLR
- What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams, MedQA benchmark, Sep 2020, arxiv
- PubMedQA: A Dataset for Biomedical Research Question Answering, 2019, acl
- Open Medical LLM Leaderboard from HF Explanation
- Evaluating Large Language Models on a Highly-specialized Topic, Radiation Oncology Physics, Apr 2023, arxiv
- Assessing the Accuracy of Responses by the Language Model ChatGPT to Questions Regarding Bariatric Surgery, Apr 2023, pub med
- Can LLMs like GPT-4 outperform traditional AI tools in dementia diagnosis? Maybe, but not today, Jun 2023, arxiv
- Evaluating the use of large language model in identifying top research questions in gastroenterology, Mar 2023, nature
- Evaluating AI systems under uncertain ground truth: a case study in dermatology, Jul 2023, arxiv
- MedDialog: Two Large-scale Medical Dialogue Datasets, Apr 2020, arxiv
- An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition, 2015, article html
- DrugBank 5.0: a major update to the DrugBank database for 2018, 2018, paper html]
- A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models, May 2024, nature, dataset
- MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records, Aug 2023, arxiv
- LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models, NeurIPS 2023
- LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain, EMNLP 2023
- Multi-LexSum: Real-world Summaries of Civil Rights Lawsuits at Multiple Granularities NeurIPS 2022
-
SciRepEval: A Multi-Format Benchmark for Scientific Document Representations, 2022, arxiv
-
GPQA: A Graduate-Level Google-Proof Q&A Benchmark, Nov 2023, arxiv
-
MATH Mathematics Aptitude Test of Heuristics, Measuring Mathematical Problem Solving With the MATH Dataset, Nov 2021 arxiv
- How well do large language models perform in arithmetic tasks?, Mar 2023, arxiv
- Cmath: Can your language model pass chinese elementary school math test?, Jun 23, arxiv
- GSM8K paperwithcode repository github
- Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering, Feb 24, arxiv
- PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance, Jun 2023, arxiv
- BloombergGPT: A Large Language Model for Finance (see Chapter 5 Evaluation), Mar 2023, arxiv
- FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets, Oct 2023, arxiv
- Understanding the Capabilities of Large Language Models for Automated Planning, May 2023, arxiv
- LLM/VLM Benchmarks by Aman Chadha
- Awesome LLMs Evaluation Papers, a list of papers mentioned in the Evaluating Large Language Models: A Comprehensive Survey, Nov 2023
@article{Lopatenko2024CompendiumLLMEvaluation,
title = {Compendium of LLM Evaluation methods},
author = {Lopatenko, Andrei},
year = {2024},
note = {\url{https://github.com/alopatenko/LLMEvaluation}}
}
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for LLMEvaluation
Similar Open Source Tools
LLMEvaluation
The LLMEvaluation repository is a comprehensive compendium of evaluation methods for Large Language Models (LLMs) and LLM-based systems. It aims to assist academics and industry professionals in creating effective evaluation suites tailored to their specific needs by reviewing industry practices for assessing LLMs and their applications. The repository covers a wide range of evaluation techniques, benchmarks, and studies related to LLMs, including areas such as embeddings, question answering, multi-turn dialogues, reasoning, multi-lingual tasks, ethical AI, biases, safe AI, code generation, summarization, software performance, agent LLM architectures, long text generation, graph understanding, and various unclassified tasks. It also includes evaluations for LLM systems in conversational systems, copilots, search and recommendation engines, task utility, and verticals like healthcare, law, science, financial, and others. The repository provides a wealth of resources for evaluating and understanding the capabilities of LLMs in different domains.
LLM-PLSE-paper
LLM-PLSE-paper is a repository focused on the applications of Large Language Models (LLMs) in Programming Language and Software Engineering (PL/SE) domains. It covers a wide range of topics including bug detection, specification inference and verification, code generation, fuzzing and testing, code model and reasoning, code understanding, IDE technologies, prompting for reasoning tasks, and agent/tool usage and planning. The repository provides a comprehensive collection of research papers, benchmarks, empirical studies, and frameworks related to the capabilities of LLMs in various PL/SE tasks.
Awesome-LLM4EDA
LLM4EDA is a repository dedicated to showcasing the emerging progress in utilizing Large Language Models for Electronic Design Automation. The repository includes resources, papers, and tools that leverage LLMs to solve problems in EDA. It covers a wide range of applications such as knowledge acquisition, code generation, code analysis, verification, and large circuit models. The goal is to provide a comprehensive understanding of how LLMs can revolutionize the EDA industry by offering innovative solutions and new interaction paradigms.
LMOps
LMOps is a research initiative focusing on fundamental research and technology for building AI products with foundation models, particularly enabling AI capabilities with Large Language Models (LLMs) and Generative AI models. The project explores various aspects such as prompt optimization, longer context handling, LLM alignment, acceleration of LLMs, LLM customization, and understanding in-context learning. It also includes tools like Promptist for automatic prompt optimization, Structured Prompting for efficient long-sequence prompts consumption, and X-Prompt for extensible prompts beyond natural language. Additionally, LLMA accelerators are developed to speed up LLM inference by referencing and copying text spans from documents. The project aims to advance technologies that facilitate prompting language models and enhance the performance of LLMs in various scenarios.
Awesome-Papers-Autonomous-Agent
Awesome-Papers-Autonomous-Agent is a curated collection of recent papers focusing on autonomous agents, specifically interested in RL-based agents and LLM-based agents. The repository aims to provide a comprehensive resource for researchers and practitioners interested in intelligent agents that can achieve goals, acquire knowledge, and continually improve. The collection includes papers on various topics such as instruction following, building agents based on world models, using language as knowledge, leveraging LLMs as a tool, generalization across tasks, continual learning, combining RL and LLM, transformer-based policies, trajectory to language, trajectory prediction, multimodal agents, training LLMs for generalization and adaptation, task-specific designing, multi-agent systems, experimental analysis, benchmarking, applications, algorithm design, and combining with RL.
awesome-hallucination-detection
This repository provides a curated list of papers, datasets, and resources related to the detection and mitigation of hallucinations in large language models (LLMs). Hallucinations refer to the generation of factually incorrect or nonsensical text by LLMs, which can be a significant challenge for their use in real-world applications. The resources in this repository aim to help researchers and practitioners better understand and address this issue.
interpret
InterpretML is an open-source package that incorporates state-of-the-art machine learning interpretability techniques under one roof. With this package, you can train interpretable glassbox models and explain blackbox systems. InterpretML helps you understand your model's global behavior, or understand the reasons behind individual predictions. Interpretability is essential for: - Model debugging - Why did my model make this mistake? - Feature Engineering - How can I improve my model? - Detecting fairness issues - Does my model discriminate? - Human-AI cooperation - How can I understand and trust the model's decisions? - Regulatory compliance - Does my model satisfy legal requirements? - High-risk applications - Healthcare, finance, judicial, ...
awesome-generative-ai
A curated list of Generative AI projects, tools, artworks, and models
Reflection_Tuning
Reflection-Tuning is a project focused on improving the quality of instruction-tuning data through a reflection-based method. It introduces Selective Reflection-Tuning, where the student model can decide whether to accept the improvements made by the teacher model. The project aims to generate high-quality instruction-response pairs by defining specific criteria for the oracle model to follow and respond to. It also evaluates the efficacy and relevance of instruction-response pairs using the r-IFD metric. The project provides code for reflection and selection processes, along with data and model weights for both V1 and V2 methods.
PurpleWave
PurpleWave is a tournament-winning AI player for StarCraft: Brood War written in Scala. It has won multiple competitions and is capable of playing all three races with a variety of professional-style strategies. PurpleWave has ranked #1 on various ladders and credits several individuals and communities for its development and success. The tool can be built using specific steps outlined in the readme and run either from IntelliJ IDEA or as a JAR file in the StarCraft directory. PurpleWave is published under the MIT License, encouraging users to use it as a starting point for their own creations.
FATE-LLM
FATE-LLM is a framework supporting federated learning for large and small language models. It promotes training efficiency of federated LLMs using Parameter-Efficient methods, protects the IP of LLMs using FedIPR, and ensures data privacy during training and inference through privacy-preserving mechanisms.
awesome-sound_event_detection
The 'awesome-sound_event_detection' repository is a curated reading list focusing on sound event detection and Sound AI. It includes research papers covering various sub-areas such as learning formulation, network architecture, pooling functions, missing or noisy audio, data augmentation, representation learning, multi-task learning, few-shot learning, zero-shot learning, knowledge transfer, polyphonic sound event detection, loss functions, audio and visual tasks, audio captioning, audio retrieval, audio generation, and more. The repository provides a comprehensive collection of papers, datasets, and resources related to sound event detection and Sound AI, making it a valuable reference for researchers and practitioners in the field.
rllm
rLLM (relationLLM) is a Pytorch library for Relational Table Learning (RTL) with LLMs. It breaks down state-of-the-art GNNs, LLMs, and TNNs as standardized modules and facilitates novel model building in a 'combine, align, and co-train' way using these modules. The library is LLM-friendly, processes various graphs as multiple tables linked by foreign keys, introduces new relational table datasets, and is supported by students and teachers from Shanghai Jiao Tong University and Tsinghua University.
Open-Medical-Reasoning-Tasks
Open Life Science AI: Medical Reasoning Tasks is a collaborative hub for developing cutting-edge reasoning tasks for Large Language Models (LLMs) in the medical, healthcare, and clinical domains. The repository aims to advance AI capabilities in healthcare by fostering accurate diagnoses, personalized treatments, and improved patient outcomes. It offers a diverse range of medical reasoning challenges such as Diagnostic Reasoning, Treatment Planning, Medical Image Analysis, Clinical Data Interpretation, Patient History Analysis, Ethical Decision Making, Medical Literature Comprehension, and Drug Interaction Assessment. Contributors can join the community of healthcare professionals, AI researchers, and enthusiasts to contribute to the repository by creating new tasks or improvements following the provided guidelines. The repository also provides resources including a task list, evaluation metrics, medical AI papers, and healthcare datasets for training and evaluation.
Awesome-LLM-Prune
This repository is dedicated to the pruning of large language models (LLMs). It aims to serve as a comprehensive resource for researchers and practitioners interested in the efficient reduction of model size while maintaining or enhancing performance. The repository contains various papers, summaries, and links related to different pruning approaches for LLMs, along with author information and publication details. It covers a wide range of topics such as structured pruning, unstructured pruning, semi-structured pruning, and benchmarking methods. Researchers and practitioners can explore different pruning techniques, understand their implications, and access relevant resources for further study and implementation.
For similar tasks
LLMEvaluation
The LLMEvaluation repository is a comprehensive compendium of evaluation methods for Large Language Models (LLMs) and LLM-based systems. It aims to assist academics and industry professionals in creating effective evaluation suites tailored to their specific needs by reviewing industry practices for assessing LLMs and their applications. The repository covers a wide range of evaluation techniques, benchmarks, and studies related to LLMs, including areas such as embeddings, question answering, multi-turn dialogues, reasoning, multi-lingual tasks, ethical AI, biases, safe AI, code generation, summarization, software performance, agent LLM architectures, long text generation, graph understanding, and various unclassified tasks. It also includes evaluations for LLM systems in conversational systems, copilots, search and recommendation engines, task utility, and verticals like healthcare, law, science, financial, and others. The repository provides a wealth of resources for evaluating and understanding the capabilities of LLMs in different domains.
AlignBench
AlignBench is the first comprehensive evaluation benchmark for assessing the alignment level of Chinese large models across multiple dimensions. It includes introduction information, data, and code related to AlignBench. The benchmark aims to evaluate the alignment performance of Chinese large language models through a multi-dimensional and rule-calibrated evaluation method, enhancing reliability and interpretability.
hallucination-leaderboard
This leaderboard evaluates the hallucination rate of various Large Language Models (LLMs) when summarizing documents. It uses a model trained by Vectara to detect hallucinations in LLM outputs. The leaderboard includes models from OpenAI, Anthropic, Google, Microsoft, Amazon, and others. The evaluation is based on 831 documents that were summarized by all the models. The leaderboard shows the hallucination rate, factual consistency rate, answer rate, and average summary length for each model.
h2o-llmstudio
H2O LLM Studio is a framework and no-code GUI designed for fine-tuning state-of-the-art large language models (LLMs). With H2O LLM Studio, you can easily and effectively fine-tune LLMs without the need for any coding experience. The GUI is specially designed for large language models, and you can finetune any LLM using a large variety of hyperparameters. You can also use recent finetuning techniques such as Low-Rank Adaptation (LoRA) and 8-bit model training with a low memory footprint. Additionally, you can use Reinforcement Learning (RL) to finetune your model (experimental), use advanced evaluation metrics to judge generated answers by the model, track and compare your model performance visually, and easily export your model to the Hugging Face Hub and share it with the community.
llm-jp-eval
LLM-jp-eval is a tool designed to automatically evaluate Japanese large language models across multiple datasets. It provides functionalities such as converting existing Japanese evaluation data to text generation task evaluation datasets, executing evaluations of large language models across multiple datasets, and generating instruction data (jaster) in the format of evaluation data prompts. Users can manage the evaluation settings through a config file and use Hydra to load them. The tool supports saving evaluation results and logs using wandb. Users can add new evaluation datasets by following specific steps and guidelines provided in the tool's documentation. It is important to note that using jaster for instruction tuning can lead to artificially high evaluation scores, so caution is advised when interpreting the results.
Awesome-LLM
Awesome-LLM is a curated list of resources related to large language models, focusing on papers, projects, frameworks, tools, tutorials, courses, opinions, and other useful resources in the field. It covers trending LLM projects, milestone papers, other papers, open LLM projects, LLM training frameworks, LLM evaluation frameworks, tools for deploying LLM, prompting libraries & tools, tutorials, courses, books, and opinions. The repository provides a comprehensive overview of the latest advancements and resources in the field of large language models.
bocoel
BoCoEL is a tool that leverages Bayesian Optimization to efficiently evaluate large language models by selecting a subset of the corpus for evaluation. It encodes individual entries into embeddings, uses Bayesian optimization to select queries, retrieves from the corpus, and provides easily managed evaluations. The tool aims to reduce computation costs during evaluation with a dynamic budget, supporting models like GPT2, Pythia, and LLAMA through integration with Hugging Face transformers and datasets. BoCoEL offers a modular design and efficient representation of the corpus to enhance evaluation quality.
cladder
CLadder is a repository containing the CLadder dataset for evaluating causal reasoning in language models. The dataset consists of yes/no questions in natural language that require statistical and causal inference to answer. It includes fields such as question_id, given_info, question, answer, reasoning, and metadata like query_type and rung. The dataset also provides prompts for evaluating language models and example questions with associated reasoning steps. Additionally, it offers dataset statistics, data variants, and code setup instructions for using the repository.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.