
BIRD-CRITIC-1
[NeurIPS 2025 Main] SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications
Stars: 769

BIRD-CRITIC 1.0 is a SQL benchmark designed to evaluate the capability of large language models (LLMs) in diagnosing and solving user issues within real-world database environments. It comprises 600 tasks for development and 200 held-out out-of-distribution tests across 4 prominent open-source SQL dialects. The benchmark expands beyond simple SELECT queries to cover a wider range of SQL operations, reflecting actual application scenarios. An optimized execution-based evaluation environment is included for rigorous and efficient validation.
README:
- π’ [2025-07-09] We release the human performance scores on our website! The scores displayed across all three leaderboards reflect human evaluators (database experts) who were allowed to use standard tools (database textbooks, official documentation, or IDEs) but not AI assistants. When another group with the same expertise was permitted to use AI tools (ChatGPT, Claude, or Gemini), the performance increased to 83.33 on Open, 87.90 on PG, and 90.00 on Flash, demonstrating the significant potential of human-AI collaboration in SQL problem-solving.
- π’ [2025-06-28] We release our paper SWE-SQL (a.k.a BIRD-CRITIC) on arxiv.
- π’ [2025-06-09] We release bird-interact-lite, feature multi-turn conversational and agentic interaction for real-world ambiguous and challenging text-to-SQL tasks.
- π’ [2025-06-08] We release bird-critic-1.0-postgresql, a single-dialect SQL issue debugging set with 530 complex tasks.
- π’ [2025-05-30] We are pleased to release LiveSQLBench-Base-Lite, featuring 18 end-user level databases and 270 tasks (180 SELECT-only, 90 Management tasks). Each task involves unambiguous and straightforward user queries grounded in external knowledge, with medium to hard complexity SQL statements.
BIRD-Critic 1.0 introduces a novel SQL benchmark designed to evaluate a key capability: Can large language models (LLMs) diagnose and solve user issues within real-world database environments?
The benchmark comprises 600 tasks for development and 200 held-out out-of-distribution (OOD) tests. BIRD-CRITIC 1.0 is built on realistic user issues across 4 prominent open-source SQL dialects: MySQL, PostgreSQL, SQL Server, and Oracle. It expands beyond simple SELECT queries to cover a wider range of SQL operations, reflecting actual application scenarios. Finally, an optimized execution-based evaluation environment is included for rigorous and efficient validation.
Each task in BIRD-CRITIC has been verified by human experts on the following dimensions:
- Reproduction of errors on the BIRD environment to prevent data leakage.
- Carefully curated test case functions for each task specifically.
- Soft EX: This metric can evaluate SELECT-ONLY tasks.
- Soft EX + Parsing: This metric can evaluate tasks with user-specific requirements or refinements.
- Test Case: For DBA tasks, such as CRUD (CREATE, READ, UPDATE, DELETE), test cases are designed to evaluate the correctness of the logic. This is also effective for user issues requiring multiple sequential SQL queries to resolve.
- Query Execution Plan: For user tasks involving efficiency improvement or runtime errors, QEP (Query Execution Plan) can be used to evaluate solution SQL queries at the algorithm level.
- Fast Eval Sandbox via PostgreSQL template & docker.
- Created new RDBs in different scales and professional domains.
We are releasing a lite version of BIRD-Critic, bird-critic-1.0-flash-exp
, which includes 200 high-quality user issues on PostgreSQL when developing real-world applications. We curate tasks by:
- Collecting and understanding realistic user issues.
- Distilling problem definitions and SQL knowledge.
- Reproducing bugs and solutions in the BIRD environment.
- Designing test cases for evaluation.
The open version of BIRD-CRITIC 1.0, bird-critic-1.0-open
, is a comprehensive benchmark that includes 570 tasks across 4 SQL dialects: MySQL, PostgreSQL, SQL Server, and Oracle. It covers a wide range of SQL operations and user issues.
Rank | Model Name | Score | Level |
---|---|---|---|
1 | o3-mini-2025-01-31 | 34.50 | π Leading |
2 | deepseek-reasoner (r1) | 33.67 | π Elite |
3 | o1-preview-2024-09-12 | 33.33 | π Elite |
4 | claude-3-7-sonnet-20250219(thinking) | 30.67 | π Elite |
5 | gemini-2.0-flash-thinking-exp-01-21 | 30.17 | π Elite |
6 | grok-3-beta | 29.83 | π Superior |
Complete results of Open version can be found here. Bird-CRITIC 1.0 Flash result can be found here
bird-critic-1.0-pg
, is a dataset containing 530 high-quality user issues focused on real-world PostgreSQL database applications.
Key Findings:
- Agentic workflows significantly improve LLM performance on debugging tasks through iterative environment interaction
- SQLACT consistently outperforms ToolACT across all models, demonstrating the advantage of SQL's richer action space for handling complex debugging scenarios
The BIRD-CRITIC 1.0 benchmark is available in the following configurations:
-
bird-critic-1.0-flash-exp
: A lite version consisting of 200 instances on PostgreSQL. -
bird-critic-1.0-open
: The full version containing 570 instances across MySQL, PostgreSQL, SQL Server, and Oracle. -
bird-critic-1.0-postgresql
: A 530-instance version specifically for PostgreSQL. -
bird-critic-1.0-bigquery
: A lite version containing between 100 and 200 instances for BigQuery.
- Database: The complete database can be download from the Google Drive. Check the Quick Eval section for more details.
-
data: Each data instance contain the following main parts:
-
db_id
: The name of the database. -
query
: The user query is rewritten in the BIRD environment. -
issue_sql
: The buggy SQL query written by the user. -
sol_sql
: The ground truth SQL solution. -
preprocess_sql
: SQL queries to run before executing the solution or prediction. -
clean_up_sql
: SQL queries to run after the test cases to revert any changes made to the database. -
test_cases
: A set of test cases to validate the predicted corrected SQL. -
efficiency
: True if this question needs optimization, measure the cost by Query Execution Plan (QEP) -
external_data
: For the external JSON data if present
-
-
baseline: The baseline code is available in the
./baseline
directory. -
evaluation: The evaluation code is available in the
./evaluation
directory. - Curated by: BIRD Team & Google Cloud
- License: cc-by-sa-4.0
- HuggingFace Dataset Card: bird-critic-1.0-flash-exp
To avoid data leakage by auto-crawling, we do not include GT solution sqls and test cases along with data. please email [email protected] or [email protected] for full set, which will be sent automatically.
You can download the dataset from HuggingFace using the following command:
from datasets import load_dataset
# Load the flash version of the dataset
dataset = load_dataset("birdsql/bird-critic-1.0-flash-exp")
print(dataset["flash"][0])
# Load the open version of the dataset
dataset = load_dataset("birdsql/bird-critic-1.0-open")
print(dataset["open"][0])
Or you can use the provided script to download the open version of the dataset and split it into different dialects.
cd baseline/data
python pull_data.py \
--schema_path path/to/open_schema.jsonl \
--input_path path/to/input.jsonl \ # Path to the input JSONL file (may be empty if you want to download the dataset from HuggingFace)
--output_folder path/to/output_dir # output folder of the split files
.
βββ LICENSE
βββ README.md
βββ baseline
βΒ Β βββ data
βΒ Β βββ outputs
βΒ Β βββ run
βΒ Β βββ src
βββ evaluation
βΒ Β βββ docker-compose.yml
βΒ Β βββ env
βΒ Β βββ mssql_table_dumps
βΒ Β βββ mysql_table_dumps
βΒ Β βββ oracle_table_dumps
βΒ Β βββ postgre_table_dumps
βΒ Β βββ run
βΒ Β βββ src
βββ materials
βΒ Β βββ ...
βββ requirements.txt
To run the baseline code you need to install the following dependencies:
conda create -n bird_critic python=3.10 -y
conda activate bird_critic
pip install -r requirements.txt
You also need to setup the model name (eg., gpt-4o-2024-08-06) with the API key in the config.py
file. Then you can run the following command to generate the output:
# Generate the prompt
cd baseline/run
bash generate_prompt.sh
# LLM Inference, need to set the API key in config.py
bash run_baseline.sh
The output will be save in the ./baseline/outputs/final_output/
We use docker to provide a consistent environment for running the benchmark. To set up the environment, follow these steps:
- First download the PostgreSQL, MySQL, SQL Server and Oracle database from the Google Drive.
- Unzip the folder and save it in the
./evaluation
named with postgre_table_dumps,mssql_table_dumps, mysql_table_dumps and oracle_table_dumps. - Build the docker compose
cd evaluation
docker compose up --build
- Interact with the database
You can use the
perform_query_on_{dialect}_databases()
function in theevaluation/src/{dialect}_utils.py
file to interact with the each database. The function will return the result of the query. - Run the evaluation script inside the so_eval_env container
docker compose exec so_eval_env bash
cd run
bash run_eval.sh
You have to specify the dialect you want to evaluate in the run_eval.sh
script. The options are:
postgresql
mysql
sqlserver
-
oracle
The output report file will be saved in the same folder as your input file. If you want the log file for each instance, you can set the--logging
totrue
in therun_eval.sh
script.
If you find our work helpful, please cite as:
@article{li2025swe,
title={SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications},
author={Li, Jinyang and Li, Xiaolong and Qu, Ge and Jacobsson, Per and Qin, Bowen and Hui, Binyuan and Si, Shuzheng and Huo, Nan and Xu, Xiaohan and Zhang, Yue and others},
journal={arXiv preprint arXiv:2506.18951},
year={2025}
}
- [x] Release lite version, bird-critic-1.0-flash (200).
- [x] Open source code, leaderboard page.
- [x] Release Full bird-critic-1.0-open (570 w/ 4 dialects).
- [x] Release Full bird-critic-1.0-postgresql (530 pg tasks).
- [ ] Release SIX-GYM (Sql-fIX), with 2000+ gym-like training environment.
- [ ] Release trained agentic baseline BIRD-Fixer.
- [ ] Update Agentic (SQL-Act) Baseline.
BIRD Team & Google Cloud
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for BIRD-CRITIC-1
Similar Open Source Tools

BIRD-CRITIC-1
BIRD-CRITIC 1.0 is a SQL benchmark designed to evaluate the capability of large language models (LLMs) in diagnosing and solving user issues within real-world database environments. It comprises 600 tasks for development and 200 held-out out-of-distribution tests across 4 prominent open-source SQL dialects. The benchmark expands beyond simple SELECT queries to cover a wider range of SQL operations, reflecting actual application scenarios. An optimized execution-based evaluation environment is included for rigorous and efficient validation.

AIOS
AIOS, a Large Language Model (LLM) Agent operating system, embeds large language model into Operating Systems (OS) as the brain of the OS, enabling an operating system "with soul" -- an important step towards AGI. AIOS is designed to optimize resource allocation, facilitate context switch across agents, enable concurrent execution of agents, provide tool service for agents, maintain access control for agents, and provide a rich set of toolkits for LLM Agent developers.

sec-parser
The `sec-parser` project simplifies extracting meaningful information from SEC EDGAR HTML documents by organizing them into semantic elements and a tree structure. It helps in parsing SEC filings for financial and regulatory analysis, analytics and data science, AI and machine learning, causal AI, and large language models. The tool is especially beneficial for AI, ML, and LLM applications by streamlining data pre-processing and feature extraction.

NExT-GPT
NExT-GPT is an end-to-end multimodal large language model that can process input and generate output in various combinations of text, image, video, and audio. It leverages existing pre-trained models and diffusion models with end-to-end instruction tuning. The repository contains code, data, and model weights for NExT-GPT, allowing users to work with different modalities and perform tasks like encoding, understanding, reasoning, and generating multimodal content.

Biomni
Biomni is a general-purpose biomedical AI agent designed to autonomously execute a wide range of research tasks across diverse biomedical subfields. By integrating cutting-edge large language model (LLM) reasoning with retrieval-augmented planning and code-based execution, Biomni helps scientists dramatically enhance research productivity and generate testable hypotheses.

resume-job-matcher
Resume Job Matcher is a Python script that automates the process of matching resumes to a job description using AI. It leverages the Anthropic Claude API or OpenAI's GPT API to analyze resumes and provide a match score along with personalized email responses for candidates. The tool offers comprehensive resume processing, advanced AI-powered analysis, in-depth evaluation & scoring, comprehensive analytics & reporting, enhanced candidate profiling, and robust system management. Users can customize font presets, generate PDF versions of unified resumes, adjust logging level, change scoring model, modify AI provider, and adjust AI model. The final score for each resume is calculated based on AI-generated match score and resume quality score, ensuring content relevance and presentation quality are considered. Troubleshooting tips, best practices, contribution guidelines, and required Python packages are provided.

RooFlow
RooFlow is a VS Code extension that enhances AI-assisted development by providing persistent project context and optimized mode interactions. It reduces token consumption and streamlines workflow by integrating Architect, Code, Test, Debug, and Ask modes. The tool simplifies setup, offers real-time updates, and provides clearer instructions through YAML-based rule files. It includes components like Memory Bank, System Prompts, VS Code Integration, and Real-time Updates. Users can install RooFlow by downloading specific files, placing them in the project structure, and running an insert-variables script. They can then start a chat, select a mode, interact with Roo, and use the 'Update Memory Bank' command for synchronization. The Memory Bank structure includes files for active context, decision log, product context, progress tracking, and system patterns. RooFlow features persistent context, real-time updates, mode collaboration, and reduced token consumption.

fastapi-langgraph-agent-production-ready-template
A production-ready FastAPI template for building AI agent applications with LangGraph integration. This template provides a robust foundation for building scalable, secure, and maintainable AI agent services. It includes features like FastAPI for high-performance async API endpoints, LangGraph integration, structured logging, rate limiting, PostgreSQL for data persistence, Docker support, security measures like JWT-based authentication and input sanitization, developer-friendly features like environment-specific configuration and type hints, a model evaluation framework with automated metric-based evaluation and detailed JSON reports, and a configuration system with environment-specific settings.

RainbowGPT
RainbowGPT is a versatile tool that offers a range of functionalities, including Stock Analysis for financial decision-making, MySQL Management for database navigation, and integration of AI technologies like GPT-4 and ChatGlm3. It provides a user-friendly interface suitable for all skill levels, ensuring seamless information flow and continuous expansion of emerging technologies. The tool enhances adaptability, creativity, and insight, making it a valuable asset for various projects and tasks.

mmore
MMORE is an open-source, end-to-end pipeline for ingesting, processing, indexing, and retrieving knowledge from various file types such as PDFs, Office docs, images, audio, video, and web pages. It standardizes content into a unified multimodal format, supports distributed CPU/GPU processing, and offers hybrid dense+sparse retrieval with an integrated RAG service through CLI and APIs.

extension-gen-ai
The Looker GenAI Extension provides code examples and resources for building a Looker Extension that integrates with Vertex AI Large Language Models (LLMs). Users can leverage the power of LLMs to enhance data exploration and analysis within Looker. The extension offers generative explore functionality to ask natural language questions about data and generative insights on dashboards to analyze data by asking questions. It leverages components like BQML Remote Models, BQML Remote UDF with Vertex AI, and Custom Fine Tune Model for different integration options. Deployment involves setting up infrastructure with Terraform and deploying the Looker Extension by creating a Looker project, copying extension files, configuring BigQuery connection, connecting to Git, and testing the extension. Users can save example prompts and configure user settings for the extension. Development of the Looker Extension environment includes installing dependencies, starting the development server, and building for production.

ragflow
RAGFlow is an open-source Retrieval-Augmented Generation (RAG) engine that combines deep document understanding with Large Language Models (LLMs) to provide accurate question-answering capabilities. It offers a streamlined RAG workflow for businesses of all sizes, enabling them to extract knowledge from unstructured data in various formats, including Word documents, slides, Excel files, images, and more. RAGFlow's key features include deep document understanding, template-based chunking, grounded citations with reduced hallucinations, compatibility with heterogeneous data sources, and an automated and effortless RAG workflow. It supports multiple recall paired with fused re-ranking, configurable LLMs and embedding models, and intuitive APIs for seamless integration with business applications.

Curie
Curie is an AI-agent framework designed for automated and rigorous scientific experimentation. It automates end-to-end workflow management, ensures methodical procedure, reliability, and interpretability, and supports ML research, system analysis, and scientific discovery. It provides a benchmark with questions from 4 Computer Science domains. Users can customize experiment agents and adapt to their own tasks by configuring base_config.json. Curie is suitable for hyperparameter tuning, algorithm behavior analysis, system performance benchmarking, and automating computational simulations.

EasyInstruct
EasyInstruct is a Python package proposed as an easy-to-use instruction processing framework for Large Language Models (LLMs) like GPT-4, LLaMA, ChatGLM in your research experiments. EasyInstruct modularizes instruction generation, selection, and prompting, while also considering their combination and interaction.

llm_aided_ocr
The LLM-Aided OCR Project is an advanced system that enhances Optical Character Recognition (OCR) output by leveraging natural language processing techniques and large language models. It offers features like PDF to image conversion, OCR using Tesseract, error correction using LLMs, smart text chunking, markdown formatting, duplicate content removal, quality assessment, support for local and cloud-based LLMs, asynchronous processing, detailed logging, and GPU acceleration. The project provides detailed technical overview, text processing pipeline, LLM integration, token management, quality assessment, logging, configuration, and customization. It requires Python 3.12+, Tesseract OCR engine, PDF2Image library, PyTesseract, and optional OpenAI or Anthropic API support for cloud-based LLMs. The installation process involves setting up the project, installing dependencies, and configuring environment variables. Users can place a PDF file in the project directory, update input file path, and run the script to generate post-processed text. The project optimizes processing with concurrent processing, context preservation, and adaptive token management. Configuration settings include choosing between local or API-based LLMs, selecting API provider, specifying models, and setting context size for local LLMs. Output files include raw OCR output and LLM-corrected text. Limitations include performance dependency on LLM quality and time-consuming processing for large documents.

AutoAgent
AutoAgent is a fully-automated and zero-code framework that enables users to create and deploy LLM agents through natural language alone. It is a top performer on the GAIA Benchmark, equipped with a native self-managing vector database, and allows for easy creation of tools, agents, and workflows without any coding. AutoAgent seamlessly integrates with a wide range of LLMs and supports both function-calling and ReAct interaction modes. It is designed to be dynamic, extensible, customized, and lightweight, serving as a personal AI assistant.
For similar tasks

BIRD-CRITIC-1
BIRD-CRITIC 1.0 is a SQL benchmark designed to evaluate the capability of large language models (LLMs) in diagnosing and solving user issues within real-world database environments. It comprises 600 tasks for development and 200 held-out out-of-distribution tests across 4 prominent open-source SQL dialects. The benchmark expands beyond simple SELECT queries to cover a wider range of SQL operations, reflecting actual application scenarios. An optimized execution-based evaluation environment is included for rigorous and efficient validation.

rpaframework
RPA Framework is an open-source collection of libraries and tools for Robotic Process Automation (RPA), designed to be used with Robot Framework and Python. It offers well-documented core libraries for Software Robot Developers, optimized for Robocorp Control Room and Developer Tools, and accepts external contributions. The project includes various libraries for tasks like archiving, browser automation, date/time manipulations, cloud services integration, encryption operations, database interactions, desktop automation, document processing, email operations, Excel manipulation, file system operations, FTP interactions, web API interactions, image manipulation, AI services, and more. The development of the repository is Python-based and requires Python version 3.8+, with tooling based on poetry and invoke for compiling, building, and running the package. The project is licensed under the Apache License 2.0.

Delphi-AI-Developer
Delphi AI Developer is a plugin that enhances the Delphi IDE with AI capabilities from OpenAI, Gemini, and Groq APIs. It assists in code generation, refactoring, and speeding up development by providing code suggestions and predefined questions. Users can interact with AI chat and databases within the IDE, customize settings, and access documentation. The plugin is open-source and under the MIT License.

awesome-mcp-servers
A curated list of awesome Model Context Protocol (MCP) servers that enable AI models to securely interact with local and remote resources through standardized server implementations. The list focuses on production-ready and experimental servers extending AI capabilities through file access, database connections, API integrations, and other contextual services.
For similar jobs

lollms-webui
LoLLMs WebUI (Lord of Large Language Multimodal Systems: One tool to rule them all) is a user-friendly interface to access and utilize various LLM (Large Language Models) and other AI models for a wide range of tasks. With over 500 AI expert conditionings across diverse domains and more than 2500 fine tuned models over multiple domains, LoLLMs WebUI provides an immediate resource for any problem, from car repair to coding assistance, legal matters, medical diagnosis, entertainment, and more. The easy-to-use UI with light and dark mode options, integration with GitHub repository, support for different personalities, and features like thumb up/down rating, copy, edit, and remove messages, local database storage, search, export, and delete multiple discussions, make LoLLMs WebUI a powerful and versatile tool.

Azure-Analytics-and-AI-Engagement
The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customerβs subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.

minio
MinIO is a High Performance Object Storage released under GNU Affero General Public License v3.0. It is API compatible with Amazon S3 cloud storage service. Use MinIO to build high performance infrastructure for machine learning, analytics and application data workloads.

mage-ai
Mage is an open-source data pipeline tool for transforming and integrating data. It offers an easy developer experience, engineering best practices built-in, and data as a first-class citizen. Mage makes it easy to build, preview, and launch data pipelines, and provides observability and scaling capabilities. It supports data integrations, streaming pipelines, and dbt integration.

AiTreasureBox
AiTreasureBox is a versatile AI tool that provides a collection of pre-trained models and algorithms for various machine learning tasks. It simplifies the process of implementing AI solutions by offering ready-to-use components that can be easily integrated into projects. With AiTreasureBox, users can quickly prototype and deploy AI applications without the need for extensive knowledge in machine learning or deep learning. The tool covers a wide range of tasks such as image classification, text generation, sentiment analysis, object detection, and more. It is designed to be user-friendly and accessible to both beginners and experienced developers, making AI development more efficient and accessible to a wider audience.

tidb
TiDB is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability.

airbyte
Airbyte is an open-source data integration platform that makes it easy to move data from any source to any destination. With Airbyte, you can build and manage data pipelines without writing any code. Airbyte provides a library of pre-built connectors that make it easy to connect to popular data sources and destinations. You can also create your own connectors using Airbyte's no-code Connector Builder or low-code CDK. Airbyte is used by data engineers and analysts at companies of all sizes to build and manage their data pipelines.

labelbox-python
Labelbox is a data-centric AI platform for enterprises to develop, optimize, and use AI to solve problems and power new products and services. Enterprises use Labelbox to curate data, generate high-quality human feedback data for computer vision and LLMs, evaluate model performance, and automate tasks by combining AI and human-centric workflows. The academic & research community uses Labelbox for cutting-edge AI research.