LLM-Agent-Evaluation-Survey

LLM-Agent-Evaluation-Survey

Top papers related to LLM-based agent evaluation

Stars: 82

Visit
 screenshot

LLM-Agent-Evaluation-Survey is a tool designed to gather feedback and evaluate the performance of AI agents. It provides a user-friendly interface for users to rate and provide comments on the interactions with AI agents. The tool aims to collect valuable insights to improve the AI agents' capabilities and enhance user experience. With LLM-Agent-Evaluation-Survey, users can easily assess the effectiveness and efficiency of AI agents in various scenarios, leading to better decision-making and optimization of AI systems.

README:

Evaluation of LLM-based Agents: A Reading List

Based on the Survey Paper:
Survey on Evaluation of LLM-based Agents (arXiv 2025)
Asaf Yehudai¹², Lilach Eden², Alan Li³, Guy Uziel², Yilun Zhao³, Roy Bar-Haim², Arman Cohan³, Michal Shmueli-Scheuer²
¹The Hebrew University of Jerusalem | ²IBM Research | ³Yale University

Paper

About This Repository

This repository serves as a companion to the survey paper "Survey on Evaluation of LLM-based Agents". It organizes evaluation methodologies, benchmarks, and frameworks according to the structure presented in the paper, aiming to provide a comprehensive resource for researchers and practitioners in the field of LLM-based agents.

The selection criteria focus on works discussed within the survey, covering:

  • Fundamental Agent Capabilities: Planning, Tool Use, Self-Reflection, Memory.
  • Application-Specific Domains: Web, Software Engineering, Scientific, Conversational Agents.
  • Generalist Agent Evaluation.
  • Evaluation Frameworks.

Our goal is to map the rapidly evolving landscape of agent evaluation, highlight key trends, and identify current limitations as discussed in the survey.

Table of Contents

Other Relevant Repositories

🎁 Surveys

  • A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?, arXiv 2025 [paper]
  • Survey on Evaluation of LLM-based Agents, arXiv 2025 [paper] (This work)
  • The Rise and Potential of Large Language Model Based Agents: A Survey, arXiv 2023 [paper]
  • A Survey on Large Language Model based Autonomous Agents, arXiv 2023 [paper]
  • Understanding the planning of LLM agents: A survey, arXiv 2024 [paper]
  • A Survey on the Memory Mechanism of Large Language Model based Agents, arXiv 2024 [paper]

🔧 Agent Capabilities Evaluation (§2)

🗺️ Planning and Multi-Step Reasoning (§2.1)

  • AQUA-RAT: Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems, ACL 2017 [paper]
  • HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering, EMNLP 2018 [paper]
  • Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, arXiv 2018 [paper]
  • MultiRC: Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences, NAACL 2018 [paper]
  • StrategyQA: Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies, TACL 2021 [paper]
  • Measuring Mathematical Problem Solving With the MATH Dataset, NeurIPS 2021 [paper]
  • GSM8K: Training Verifiers to Solve Math Word Problems, arXiv 2021 [paper]
  • FOLIO: Natural Language Reasoning with First-Order Logic, EMNLP 2022 Findings [paper]
  • Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them, arXiv 2022 [paper]
  • Game of 24: Tree of Thoughts: Deliberate Problem Solving with Large Language Models, NeurIPS 2023 [paper]
  • MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback, arXiv 2023 [paper]
  • PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change, NeurIPS 2023 [paper]
  • MUSR: Testing the Limits of Chain-of-Thought with Multistep Soft Reasoning, arXiv 2023 [paper]
  • ToolEmu: Identifying the Risks of LM Agents with an LM-Emulated Sandbox, arXiv 2023 [paper]
  • AutoPlanBench: Automating the Generation of Prompts for LLM-based Action Choice in PDDL Planning, arXiv 2023 [paper]
  • FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents, arXiv 2024 [paper]
  • P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains, EMNLP 2024 Findings [paper]
  • ACPBench: Reasoning about Action, Change, and Planning, arXiv 2024 [paper]
  • Natural Plan: Benchmarking LLMs on Natural Language Planning, arXiv 2024 [paper]
  • LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models, ACL 2024 [paper]

📞 Function Calling & Tool Use (§2.2)

  • APIBench: Revisiting, Benchmarking and Exploring API Recommendation: How Far Are We? arXiv 2021 [paper]
  • ToolBench: ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, arXiv 2023 [paper]
  • ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases, arXiv 2023 [paper]
  • API-Bank: A Benchmark for Tool-Augmented LLMs, EMNLP 2023 [paper]
  • NexusRaven-V2: Surpassing GPT-4 for Zero-Shot Function Calling, Blog Post 2023 [link]
  • RestGPT: Connecting Large Language Models with Real-World RESTful APIs, arXiv 2023 [paper]
  • BFCL: Berkeley Function Calling Leaderboard, Blog Post 2024 [link]
  • Seal-Tools: Self-Instruct Tool Learning Dataset for Agent Tuning and Detailed Benchmark, arXiv 2024 [paper]
  • ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities, arXiv 2024 [paper]
  • APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets, NeurIPS 2024 [paper]
  • StableToolBench: Towards Stable Large-scale Benchmarking on Tool Learning of Large Language Models, arXiv 2024 [paper]
  • NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls, arXiv 2024 [paper]
  • API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs, arXiv 2024 [paper]
  • ComplexFuncBench: Exploring Multi-step and Constrained Function Calling under Long-Context Scenario, arXiv 2025 [paper]

🤔 Self-Reflection (§2.3)

  • MiniWoB++: Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration, ICLR 2019 [paper]
  • ALFWorld: Aligning Text and Embodied Environments for Interactive Learning, ICLR 2021 [paper]
  • MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical Domain Question Answering, arXiv 2022 [paper]
  • LLF-Bench: Benchmark for Interactive Learning from Language Feedback, arXiv 2023 [paper]
  • Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS 2023 [paper]
  • AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models, arXiv 2023 [paper]
  • Large Language Models Cannot Self-Correct Reasoning Yet, arXiv 2023 (Huang et al.) [paper]
  • LLM-Evolve: Evaluation for LLM's Evolving Capability on Benchmarks, EMNLP 2024 [paper]
  • Reflection-Bench: Probing AI Intelligence with Reflection, arXiv 2024 [paper]
  • Self-reflection makes Large Language Models safer, less biased, and ideologically neutral, arXiv 2024 (Liu et al.) [paper]

💾 Memory (§2.4)

  • ReadAgent: The NarrativeQA Reading Comprehension Challenge, TACL 2018 [paper]
  • QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization, NAACL 2021 [paper]
  • QUALITY: Question Answering with Long Input Texts, Yes!, NAACL 2022 Findings [paper]
  • MemGPT: Towards LLMs as Operating Systems, arXiv 2023 [paper]
  • Reflexion: Language Agents with Verbal Reinforcement Learning, NeurIPS 2023 [paper]
  • RAISE: From LLM to Conversational Agent: A Memory Enhanced Architecture with Fine-Tuning of Large Language Models, arXiv 2024 [paper]
  • A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts, arXiv 2024 [paper]
  • LoCoMo: Evaluating Very Long-Term Conversational Memory of LLM Agents, arXiv 2024 [paper]
  • StreamBench: Towards Benchmarking Continuous Improvement of Language Agents, arXiv 2024 [paper]
  • LTMbenchmark: Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models, arXiv 2024 [paper]
  • KARMA: Augmenting Embodied AI Agents with Long-and-short Term Memory Systems, arXiv 2024 [paper]
  • A-Mem: Agentic Memory for LLM Agents, arXiv 2025 [paper]

🎯 Application-Specific Agent Evaluation (§3)

🌐 Web Agents (§3.1)

  • MiniWob: World of Bits: An Open-Domain Platform for Web-Based Agents, ICML 2017 [paper]
  • MiniWoB++: Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration, ICLR 2019 [paper]
  • WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents, NeurIPS 2022 [paper]
  • Mind2Web: Towards a Generalist Agent for the Web, NeurIPS 2023 [paper]
  • WebArena: A Realistic Web Environment for Building Autonomous Agents, ICLR 2024 (arXiv 2023) [paper]
  • WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models, arXiv 2024 [paper]
  • VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks, arXiv 2024 [paper]
  • WebLinX: Real-World Website Navigation with Multi-Turn Dialogue, arXiv 2024 [paper]
  • WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?, arXiv 2024 [paper]
  • MMInA: Benchmarking Multihop Multimodal Internet Agents, arXiv 2024 [paper]
  • WebCanvas: Benchmarking Web Agents in Online Environments, arXiv 2024 [paper]
  • WorkArena++: Towards Compositional Planning and Reasoning-Based Common Knowledge Work Tasks, NeurIPS 2024 [paper]
  • AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?, arXiv 2024 [paper]
  • ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents, arXiv 2024 [paper]
  • VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks, arXiv 2025 [paper]
  • BEARCUBS: A benchmark for computer-using web agents, arXiv 2025 [paper]
  • Online-Mind2Web:An Illusion of Progress? Assessing the Current State of Web Agents, arXiv 2025 [paper]
  • BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents, arXiv 2025 [paper]
  • Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge, arXiv 2025 [paper]

💻 Software Engineering Agents (§3.2)

  • HumanEval: Evaluating Large Language Models Trained on Code, arXiv 2021 [paper]
  • SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, NeurIPS 2023 [paper]
  • AgentBench: Evaluating LLMs as Agents, ICLR 2024 (arXiv 2023) [paper]
  • SWE-bench Verified: OpenAI Blog Post 2024 [link]
  • SWE-bench Lite: SWE-bench website (c. 2024) [link]
  • SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents, NeurIPS 2024 [paper]
  • SWE-bench+: Enhanced Coding Benchmark for LLMs, arXiv 2024 [paper]
  • SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?, arXiv 2024 [paper]
  • TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get Resolved?, arXiv 2024 [paper]
  • Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale, arXiv 2024 [paper]
  • IT-Bench: Evaluating AI Agents Across Diverse Real-World IT Automation Tasks, arXiv 2025 [paper]
  • Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models, ICLR 2025 [paper]
  • SWELancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?, arXiv 2025 [paper]
  • CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis, arXiv 2025 [paper]

🔬 Scientific Agents (§3.3)

  • QASPER: A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers, NAACL 2021 [paper]
  • MS²: Multi-Document Summarization of Medical Studies, EMNLP 2021 Findings [paper]
  • ScienceQA: Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering, NeurIPS 2022 [paper]
  • ScienceWorld: Is Your Agent Smarter Than a 5th Grader?, EMNLP 2022 [paper]
  • DiscoveryWorld: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents, arXiv 2024 [paper]
  • LAB-Bench: Measuring Capabilities of Language Models for Biology Research, arXiv 2024 [paper]
  • SciCode: A Research Coding Benchmark Curated by Scientists, NeurIPS 2024 Datasets Track [paper]
  • Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers: ICLR 2025 (arXiv 2024) [paper]
  • SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories, arXiv 2024 [paper]
  • CORE-Bench: Fostering the Credibility of Published Research through a Computational Reproducibility Agent Benchmark, arXiv 2024 [paper]
  • ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery, arXiv 2024 [paper]
  • AAAR-1.0: Assessing AI's Potential to Assist Research, arXiv 2024 [paper]
  • MLGym-Bench: A New Framework and Benchmark for Advancing AI Research Agents, arXiv 2025 [paper]
  • PaperBench: Evaluating AI's Ability to Replicate AI Research, arXiv 2025 [paper]

💬 Conversational Agents (§3.4)

  • MultiWOZ: A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling, EMNLP 2018 [paper]
  • SMCalFlow: Task-Oriented Dialogue as Dataflow Synthesis, TACL 2020 [paper]
  • ABCD: Action-Based Conversations Dataset: A Corpus for Building More In-Depth Task-Oriented Dialogue Systems, LREC 2022 [paper]
  • τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains, arXiv 2024 [paper]
  • ALMITA: Automated test generation to evaluate tool-augmented LLMs as conversational AI agents, GenBench @ EMNLP 2024 [paper]
  • LTM-Benchmark: Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models:, arXiv 2024 [paper]
  • IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems, arXiv 2025 [paper]

🌍 Generalist Agents Evaluation (§4)

  • GAIA: A Benchmark for General AI Assistants, arXiv 2023 [paper]
  • AgentBench: Evaluating LLMs as Agents, ICLR 2024 (arXiv 2023) [paper]
  • OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web, ECCV 2024 [paper]
  • OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, NeurIPS 2024 Datasets Track [paper]
  • Galileo's Agent Leaderboard: HuggingFace Space (c. 2024) [link]
  • AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents, ACL 2024 [paper]
  • AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents, arXiv 2024 [paper]
  • CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments, arXiv 2024 [paper]
  • TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks, arXiv 2024 [paper]
  • AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents, NeurIPS 2024 [paper]
  • HAL: Holistic Agent Leaderboard, HAL leaderboard 2025 [paper]

🏗️ Frameworks for Agent Evaluation (§5)

🎮 Gym-like Environments (§5.1)

  • BrowserGym: The BrowserGym Ecosystem for Web Agent Research [paper]
  • MLGym: A New Framework and Benchmark for Advancing AI Research Agents [paper]
  • SWE-Gym: Training Software Engineering Agents and Verifiers with SWE-Gym [paper]
  • PersonaGym: Evaluating Persona Agents and LLMs [paper]
  • RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing [paper]

📈 Discussion (§6)

Trends & Directions

(Refer to Section 6 in the paper for detailed discussion)

➡️ Current Trends (§6.1)

Realistic and Challenging Evaluation

The field is moving beyond simplified, static environments towards benchmarks reflecting real-world complexity and increased difficulty.

Live Benchmarks

Static benchmarks quickly become outdated. There is a trend towards adaptive benchmarks that incorporate live data or continuous updates to maintain relevance.

🧭 Emergent Directions (§6.2)

Advancing Granular Evaluation

Moving beyond coarse, end-to-end success metrics to more detailed, step-by-step analysis to diagnose failures.

  • WebCanvas: Benchmarking Web Agents in Online Environments (Measures key node completion) [paper]
  • LangSmith: LangChain Evaluation Framework (Supports trajectory tracing) [link]
  • Galileo Agentic Evaluation: (Introduces action advancement metric) [link]

Cost and Efficiency Metrics

Increasing focus on measuring resource consumption (tokens, time, API calls) alongside performance.

Scaling & Automating Evaluation

Developing methods to reduce reliance on manual annotation and enable continuous, large-scale evaluation.

  • Synthetic Data Generation:
  • Agent-as-a-Judge:
    • Agent-as-a-Judge: Evaluate agents with agents [paper]

Safety and Compliance

Growing need for benchmarks that specifically test safety, trustworthiness, robustness against adversarial inputs, and adherence to policies.

  • AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents, ICLR 2025 [paper]
  • ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents, ICLR 2025 [paper]
  • Multi-Agent Risks from Advanced AI, arXiv 2025 [paper]
  • Evaluating Cultural and Social Awareness of LLM Web Agents, NAACL-findings 2025 [paper]
  • Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In, NAACL 2025 [paper]

➕ Adding a Benchmark / Paper

We aim to keep this list comprehensive and up-to-date within the scope of LLM-based Agent Evaluation, as covered in our survey. If you know of a relevant benchmark, evaluation framework, or significant paper that fits this focus and is missing from the list, we welcome your suggestions!

Contribution Guidelines:

  1. Relevance: Please ensure the work is directly related to the evaluation methodologies, benchmarks, frameworks, or core capabilities/applications as they pertain to assessing LLM-based agents.
  2. Information: Provide the full paper title, authors, publication venue/year (or arXiv link), and a direct link to the paper (DOI, arXiv, etc.).
  3. Justification (Optional but helpful): Briefly explain why the paper is relevant and where it might fit within the existing structure.

How to Suggest:

  • Preferred Method: Open a GitHub Issue:
    • Click on the "Issues" tab of this repository.
    • Create a "New Issue".
    • Use a descriptive title (e.g., "Suggestion: Add [Benchmark/Paper Name]").
    • Include the information requested above in the issue description.
  • Alternative: Submit a Pull Request:
    • If you are comfortable editing the README.md file directly, you can fork the repository, add the paper following the existing format, and submit a Pull Request. Please ensure your addition is placed in the appropriate section.

We appreciate your contributions to making this a valuable resource for the community!

🔗 Other Relevant Repositories

While this repository mirrors our survey's scope, other excellent repositories cover LLM Agents more broadly or from different angles:

General LLM Agent Papers:

Specific Focus Repositories:

📝 Citation

If you find this survey or repository helpful, please cite the paper:

@misc{yehudai2025survey,
      title={Survey on Evaluation of LLM-based Agents},
      author={Asaf Yehudai and Lilach Eden and Alan Li and Guy Uziel and Yilun Zhao and Roy Bar-Haim and Arman Cohan and Michal Shmueli-Scheuer},
      year={2025},
      eprint={2503.16416},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2503.16416}
}

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for LLM-Agent-Evaluation-Survey

Similar Open Source Tools

For similar tasks

For similar jobs