ts-bench

ts-bench

Measure and compare the performance of AI coding agents on TypeScript tasks.

Stars: 141

Visit
 screenshot

TS-Bench is a performance benchmarking tool for TypeScript projects. It provides detailed insights into the performance of TypeScript code, helping developers optimize their projects. With TS-Bench, users can measure and compare the execution time of different code snippets, functions, or modules. The tool offers a user-friendly interface for running benchmarks and analyzing the results. TS-Bench is a valuable asset for developers looking to enhance the performance of their TypeScript applications.

README:

ts-bench: TypeScript Agent Benchmark

ts-bench is a transparent and reproducible benchmark project for evaluating the TypeScript code editing capabilities of AI coding agents.

Leaderboard

Rank Agent Model Success Rate Solved Avg Time Result
1 opencode openai/gpt-5 96.0% 24/25 64.8s #415419
2 goose claude-sonnet-4-20250514 92.0% 23/25 122.2s #186071
3 opencode anthropic/claude-sonnet-4-20250514 92.0% 23/25 127.8s #043809
4 gemini gemini-2.5-pro 92.0% 23/25 168.5s #052819
5 codex gpt-5 88.0% 22/25 91.7s #734992
6 opencode opencode/grok-code 88.0% 22/25 97.0s #083421
7 claude glm-4.5 80.0% 20/25 172.3s #591219
8 claude claude-sonnet-4-20250514 72.0% 18/25 206.1s #732069
9 qwen qwen3-coder-plus 64.0% 16/25 123.9s #246268
10 aider claude-sonnet-4-20250514 32.0% 8/25 40.5s #119174

πŸ€– Supported Agents

Currently supported agents:

πŸ“– Vision & Principles

This project is strongly inspired by benchmarks like Aider Polyglot. Rather than measuring the performance of large language models (LLMs) alone, it focuses on evaluating the agent layerβ€”the entire AI coding assistant tool, including prompt strategies, file operations, and iterative logic.

Based on this vision, the benchmark is designed according to the following principles:

  • TypeScript-First: Focused on TypeScript, which is essential in modern development. Static typing presents unique challenges and opportunities for AI agents, making it a crucial evaluation target.
  • Agent-Agnostic: Designed to be independent of any specific AI agent, allowing fair comparison of multiple CLI-based agents such as Aider and Claude Code.
  • Baseline Performance: Uses self-contained problem sets sourced from Exercism to serve as a baseline for measuring basic code reading and editing abilities. It is not intended to measure performance on large-scale editing tasks or complex bug fixes across entire repositories like SWE-bench.

πŸ“Š Results & Methodology

All benchmark results are generated and published via GitHub Actions.

Each results page provides a formatted summary and downloadable artifacts containing raw data (JSON).

Documentation

For detailed documentation, see:

πŸš€ Getting Started

Installation

bun install

Usage

Run the benchmark with the following commands. Use --help to see all available options.

# Run the default 25 problems with Claude Code (Sonnet 3.5)
bun src/index.ts --agent claude --model claude-3-5-sonnet-20240620

# Run only the 'acronym' problem with Aider (GPT-4o)
bun src/index.ts --agent aider --model gpt-4o --exercise acronym

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for ts-bench

Similar Open Source Tools

For similar tasks

For similar jobs