web-bench

web-bench

Web-Bench is a benchmark designed to evaluate the performance of LLMs in actual Web development.

Stars: 205

Visit
 screenshot

Web-bench is a simple tool for benchmarking web servers. It is designed to generate a large number of requests to a web server and measure the performance of the server under load. The tool allows users to specify the number of requests, concurrency level, and other parameters to simulate different traffic scenarios. Web-bench provides detailed statistics on response times, throughput, and errors encountered during the benchmarking process. It is a useful tool for web developers, system administrators, and anyone interested in evaluating the performance of web servers.

README:

Web-Bench

中文InstallPaperDatasetsLeaderBoardCitation

📖 Overview

Web-Bench is a benchmark designed to evaluate the performance of LLMs in actual Web development. Web-Bench contains 50 projects, each consisting of 20 tasks with sequential dependencies. The tasks implement project features in sequence, simulating real-world human development workflows. When designing Web-Bench, we aim to cover the foundational elements of Web development: Web Standards and Web Frameworks. Given the scale and complexity of these projects, which were designed by engineers with 5-10 years of experience, each presents a significant challenge. On average, a single project takes 4–8 hours for a senior engineer to complete. On our given benchmark agent (Web-Agent), SOTA (Claude 3.7 Sonnet) achieves only 25.1% Pass@1.

The distribution of the experimental data aligns well with the current code generation capabilities of mainstream LLMs.

pass@1

HumanEval and MBPP have approached saturation. APPS and EvalPlus are approaching saturation. The SOTA for Web-Bench is 25.1%, which is lower (better) than that of the SWE-bench Full and Verified sets.

SOTAs

🚀 Quick Start

Refer to the Docker setup guide for instructions on installing Docker on your machine

  1. Create a new empty folder, add two files in this folder:
./config.json5
./docker-compose.yml
  1. For config.json5, copy the json below and edit by Config Parameters:
{
  models: [
    'openai/gpt-4o',
    // You can add more models here
    // "claude-sonnet-4-20250514"
  ],
  // Eval one project only
  // "projects": ["@web-bench/react"]
}
  1. For docker-compose.yml, copy the yaml below and set environment
services:
  web-bench:
    image: maoyiweiebay777/web-bench:latest
    volumes:
      - ./config.json5:/app/apps/eval/src/config.json5
      - ./report:/app/apps/eval/report
    environment:
      # Add enviorment variables according to apps/src/model.json
      - OPENROUTER_API_KEY=your_api_key
      # Add more model's key
      # - ANTHROPIC_API_KEY=your_api_key
  1. Run docker-compose:
docker compose up
  1. Evaluation Report will be generated under ./report/

If you wish to evaluate from source code, refer to Install from source.

🛠️ Contribution

📚 Citation

@article{xu2025webbench,
  title={Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks},
  author={Xu, Kai and Mao, YiWei and Guan, XinYi and Feng, ZiLong},
  journal={arXiv preprint arXiv:2505.07473},
  year={2025}
}

📄 License

Apache 2.0

🌟 Contact us

  • Lark: Scan the QR code below with Register Feishu to join our Web Bench user group.

pass@1

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for web-bench

Similar Open Source Tools

For similar tasks

For similar jobs