CrackSQL

LLM-based Dialect Translation System

Stars: 63

Visit

CrackSQL is a powerful SQL dialect translation tool that integrates rule-based strategies with large language models (LLMs) for high accuracy. It enables seamless conversion between dialects (e.g., PostgreSQL → MySQL) with flexible access through Python API, command line, and web interface. The tool supports extensive dialect compatibility, precision & advanced processing, and versatile access & integration. It offers three modes for dialect translation and demonstrates high translation accuracy over collected benchmarks. Users can deploy CrackSQL using PyPI package installation or source code installation methods. The tool can be extended to support additional syntax, new dialects, and improve translation efficiency. The project is actively maintained and welcomes contributions from the community.

README:

CrackSQL

📄 Unlock seamless SQL translation – effortless, precise, and efficient across databases~ 🐬

Demo • Quick Start • Feature Extension • FAQ • Community • Contributors • License

English | 简体中文

Star ⭐ and subscribe 🔔 for the latest features and improvements!

✨ Project Introduction

CrackSQL is a powerful SQL dialect translation tool that integrates rule-based strategies with LLMs for high accuracy. It enables seamless conversion between dialects (e.g., PostgreSQL → MySQL) with flexible access through Python API, command line, and web interface.

09/2025: Our newly-developed benchmark PARROT (i.e., the first SQL-to-SQL benchmark) has been accepted by NeurIPS 2025 ! Welcome your submission to challenge our leaderboard! ⚔️ 🏆 🥇

06/2025: We have been invited by the Ploutos community to present CrackSQL at their livestream event! 🎥

04/2025: We have released our demo paper about this project that can be found online! 📚

03/2025: We have refactored the code and released our project across multiple open-source platforms (PyPI). We are currently working on new features and more contributors are welcomed! 👋 👫

02/2025: Our paper "Cracking SQL Barrier: An LLM-based Dialect Translation System" has been accepted by SIGMOD 2025! 🎉 🎉 🎉

📚 Features

🚀 Extensive Dialect Compatibility: Effortlessly translates between PostgreSQL, MySQL, and Oracle with tailored flexible strategies.
🎯 Precision & Advanced Processing: Achieves flawless translations with function-oriented query handling and cutting-edge model-based syntax matching through an adaptive local-to-global iteration strategy.
🔄 Versatile Access & Integration: Seamlessly integrates with Python API, command line, and web interface to meet all user requirements.

Currently, CrackSQL has integrated three modes for dialect translation, adopting the rules from SQLGlot and supporting a wide range of large language models (LLMs), including prevalent models like GPT as well as the recent DeepSeek.

Mode	SQL Dialect	LLM (w/o & w fine-tuned)		Embedding Model (w/o & w fine-tuned)
Mode	SQL Dialect	Cloud Service (e.g., 💬 GPT series)	Local Deployed (e.g., 🤗 Hugging Face)	Cloud Service (e.g., 💬 GPT series)	Local Deployed (e.g., 🤗 Hugging Face)
Rule-only	24	-	-	-	-
LLM-direct	✅	✅	✅	-	-
Rule+LLM	3 (pg/mysql/oracle)	✅	✅	✅	✅

Additionally, the prerequisites for each mode are listed below, where SQL Parser (ANTLR) and Dialect Specification have already been provided. Please refer to Feature Extension section to customize and enhance CrackSQL to make it more powerful for your own cases.

Mode	SQL Dialect			Model Service
Mode	SQL Parser	Dialect Specification	Database Connection	LLM	Embedding Model
Rule-only	✅ (SQLGlot)	-	-	-	-
LLM-direct	-	-	-	✅	-
Rule+LLM	✅ (ANTLR)	✅	✅ / -	✅	✅ / -

📊 Performance

The following table demonstrates the translation accuracy (%) of different methods over our collected benchmark (N/A denotes the dialect translation is not supported in Ora2Pg).

(1) $Acc_{EX}$ indicates the translated SQL is syntactically correct and executable over the target database.
(2) $Acc_{RES}$ represents the translated SQL delivers exactly the same result (including the displayed order) as the original ones.

Note that the required translation duration is highly dependent on the SQL complexity (e.g., the number of SQL syntax piece to be translated) and can vary from several seconds to minutes.

Method	PG → MySQL	PG → MySQL	MySQL → PG	MySQL → PG	PG → Oracle	PG → Oracle
	$Acc_{EX}$	$Acc_{RES}$	$Acc_{EX}$	$Acc_{RES}$	$Acc_{EX}$	$Acc_{RES}$
SQLGlot	74.19	70.97	60.32	60.32	55.81	53.49
jOOQ	70.97	70.97	39.68	39.68	62.79	60.47
Ora2Pg	N/A	N/A	33.33	33.33	N/A	N/A
SQLines	9.68	9.68	31.75	31.75	53.49	48.84
GPT-4o	61.29	61.29	50.79	44.44	60.47	55.81
CrackSQL (Ours)	87.1	74.19	85.71	79.37	69.77	67.44

🎬 Demo

The following showcases the primary pages of the CrackSQL interface service, including the service guidance homepage and detailed translation process.

Homepage of the deployed translation service:

Detailed translation process of specific translation pair:

🕹 Quick Start

We have currently offered two methods (i.e., PyPI package and source code installation) to deploy CrackSQL.

Method 1: PyPI Package Installation

Install the PyPI package at the official website.

# create virtual environment
conda create -n CrackSQL python=3.10
conda activate CrackSQL

# install PyPI package
pip install cracksql

Run with the PyPI package. A running code example using this package is presented below:

from cracksql.cracksql import translate, initkb

def initkb_func():
    try:
        initkb("./init_config.yaml")  # fill the basic configurations in the `.yaml` first
        print("Knowledge base initialized successfully")
    except Exception as e:
        print(f"Knowledge base initialization failed: {str(e)}")
        import traceback
        traceback.print_exc()


def trans_func():
    target_db_config = {
        "host": "target database host",
        "port": target database number (int type),
        "user": "target database username",
        "password": "target database password",
        "db_name": "target database database name"
    }

    vector_config = {
        "src_kb_name": "source database knowledge base name",
        "tgt_kb_name": "target database knowledge base name"
    }

    try:
        print("Starting SQL translation...")
        translated_sql, model_ans_list, used_pieces, lift_histories = translate(
            model_name="DeepSeek-R1-Distill-Qwen-32B", 
            src_sql='SELECT DISTINCT "t1"."id" , EXTRACT(YEAR FROM CURRENT_TIMESTAMP) - EXTRACT(YEAR FROM CAST( "t1"."birthday" AS TIMESTAMP )) FROM "patient" AS "t1" INNER JOIN "examination" AS "t2" ON "t1"."id" = "t2"."id" WHERE "t2"."rvvt" = "+"',
            src_dialect="postgresql",
            tgt_dialect="mysql",
            target_db_config=target_db_config,
            vector_config=vector_config,
            out_dir="./", 
            retrieval_on=False, 
            top_k=3
        )

        print("Translation completed!")
        print(f"Translated SQL: {translated_sql}")
        print(f"Model answer list: {model_ans_list}")
        print(f"Used knowledge pieces: {used_pieces}")
        print(f"Lift histories: {lift_histories}")
    except Exception as e:
        print(f"Error occurred during translation: {str(e)}")
        import traceback
        traceback.print_exc()


if __name__ == "__main__":

    initkb_func()
    trans_func()

Method 2: Source Code Installation

1. Clone Repository

git clone https://github.com/weAIDB/CrackSQL.git

2. Use Frontend and Backend Application

# Start backend
cd CrackSQL/backend

# Install dependencies
conda create -n CrackSQL python=3.10
conda activate CrackSQL
pip install -r requirements.txt

# Initialize database
flask db init      # Initialize
flask db migrate   # Generate version file
flask db upgrade   # Synchronize to database

# Initialize knowledge base (Optional, can be done manually in the frontend after starting the frontend project)
# 1. First rename config/init_config.yaml.copy to config/init_config.yaml
# 2. Modify the relevant information in config/init_config.yaml. If you want to initialize the knowledge base, Embedding Model is required
python init_knowledge_base.py  --config_file xxxxxxxxx

# Start backend service (The backend service port can also be modified in app.py, currently 30006)
python app.py

# Open a new terminal window, start frontend (requires nodejs, version 20.11.1+)
cd CrackSQL/webui

# Install dependencies
pnpm install

# Start development server
pnpm run dev

# Visit http://localhost:50212 to use the Web interface

# Tips: 
# If you want to modify the frontend port number, you can modify it in webui/vite.config.js: port: 50212
# If the backend API port number has been changed, or you want to use the server's IP, you can modify the VITE_APP_BASE_URL parameter in webui/.env.serve-dev file (if the file does not exist, you can rename webui/.env.serve-dev_template to .env.serve-dev).

3. Command Line Usage

# Initialize knowledge base (Optional, can be done manually in the frontend after starting the frontend project)
# 1. First rename config/init_config.yaml.copy to config/init_config.yaml
# 2. Modify the relevant information in config/init_config.yaml. If you want to initialize the knowledge base, Embedding Model is required
python init_knowledge_base.py --init_all

# Translate
# specify the required configurations displayed by `--help` command
python translate.py --help

📎 Feature Extension

📄 Complement Additional Syntax and Specification

1. Additional Syntax

To complement additional syntax, you can modify the .g4 files in ANTLR according to the grammar rules shown below. In this grammar, each parsing rule is structured recursively and consists of both non-terminal and terminal tokens. Once your .g4 files are prepared, you can use the official ANTLR tool to generate an updated Python parser for integration into CrackSQL.

sql_script
    : sql_plus_command_no_semicolon? (
        (sql_plus_command | unit_statement) (SEMICOLON '/'? (sql_plus_command | unit_statement))* SEMICOLON? '/'?
    ) EOF
    ;
......

2. Additional Specification

To complement additional specification, you can append new specifications to a .json file in the following format.

[
  {
    "keyword": "the SQL snippet, REQUIRED",
    "type": "function/keyword/type/operator, REQUIRED",
    "tree": "syntax tree generated by SQL parser, REQUIRED",
    "description": "brief usage description, REQUIRED",
    "detail": "detailed usage illustration, REQUIRED (empty string if None)",
    "link": ["link1", "link2", "link3"],
    "example": ["example1", "example2", "example3"]
  },
  {
    ......
  }
]

🐬 Add New Dialect

Enabling CrackSQL to support new dialects requires two key components: (1) a dialect syntax parser and (2) functionality specifications.

1. New Syntax Parser

You can start by checking the official ANTLR repository to see if the desired dialect grammar (i.e., ANTLR .g4 files) is already available. If the required grammar does not exist, you need to compose the corresponding ANTLR grammar files to build the SQL syntax parser. Once the .g4 files are ready, you can use the official ANTLR tool to generate an updated Python parser. This parser can then be integrated into CrackSQL.

2. New Dialect Specification

You need to transform the functionality specifications (e.g., the Oracle function descriptions) into a .json file. In this file, each item should be organized according to the following format.

[
  {
    "keyword": "the SQL snippet, REQUIRED",
    "type": "snippet type from four options: '(1) function, or (2) keyword, or (3) type, or (4) operator', REQUIRED",
    "tree": "syntax tree generated by SQL parser, REQUIRED",
    "description": "brief usage description, REQUIRED",
    "detail": "detailed usage illustration, REQUIRED (empty string if None)",
    "link": ["link1", "link2", "link3"],
    "example": ["example1", "example2", "example3"]
  },
  {
    ......
  }
]

🤔 FAQ

Q: How to make CrackSQL support additional syntax or new dialect?

A: To support additional syntax, you need to modify the `.g4` files in ANTLR and then generate an updated Python parser. Moreover, you should provide the corresponding dialect specifications for the newly-added syntax.

To support new dialect, two key components (i.e., dialect syntax parser and functionality specifications) are required. Currently, the syntax parser is created based on ANTLR grammar, and the specifications can be derived from processing official documents.

For more detailed information, please refer to the Feature Extension section.

📋 TODO

Effective Artifact Release: We are currently preparing our MoE-based cross-dialect embedding models for practical usage and intend to release them on Hugging Face soon.
Comprehensive Dialect Support: We will support more dialects with prepared syntax parser and functionality specifications, which is a longstanding work and more contributors are welcomed!
Translation Efficiency Improvement: We aim to implement the rules discovered by LLMs into rule systems, thus reducing the LLM invocation overhead.

👫 Community

We deeply appreciate the invaluable effort contributed by our dedicated team of developers, supportive users, and esteemed industry partners.

📒 Citation

If you like this project, please cite our paper:

@article{zhou2025cracksql,
  author       = {Wei Zhou and
                  Yuyang Gao and
                  Xuanhe Zhou and
                  Guoliang Li},
  title        = {{Cracking SQL Barriers:} {An}  LLM-based Dialect Transaltion System},
  journal      = {Proc. {ACM} Manag. Data},
  volume       = {3},
  number       = {3 (SIGMOD)},
  year         = {2025}
}

@article{zhou2025cracksqldemo,
  author       = {Wei Zhou and
                  Yuyang Gao and
                  Xuanhe Zhou and
                  Guoliang Li},
  title        = {CrackSQL: A Hybrid SQL Dialect Translation System Powered by Large Language Models},
  journal      = {arXiv Preprint},
  url       = {https://arxiv.org/abs/2504.00882},
  year         = {2025}
}

📝 License

This project is licensed under the MIT License - see the LICENSE file for details

For Tasks:

Click tags to check more tools for each tasks

translate sql queries integrate with python api deploy via command line support new dialects improve translation efficiency

For Jobs:

data analyst database administrator software engineer data scientist sql developer

Alternative AI tools for CrackSQL

Similar Open Source Tools

CrackSQL

github

: 63

SynapseML

SynapseML (previously known as MMLSpark) is an open-source library that simplifies the creation of massively scalable machine learning (ML) pipelines. It provides simple, composable, and distributed APIs for various machine learning tasks such as text analytics, vision, anomaly detection, and more. Built on Apache Spark, SynapseML allows seamless integration of models into existing workflows. It supports training and evaluation on single-node, multi-node, and resizable clusters, enabling scalability without resource wastage. Compatible with Python, R, Scala, Java, and .NET, SynapseML abstracts over different data sources for easy experimentation. Requires Scala 2.12, Spark 3.4+, and Python 3.8+.

github

: 5.0k

LTEngine

LTEngine is a free and open-source local AI machine translation API written in Rust. It is self-hosted and compatible with LibreTranslate. LTEngine utilizes large language models (LLMs) via llama.cpp, offering high-quality translations that rival or surpass DeepL for certain languages. It supports various accelerators like CUDA, Metal, and Vulkan, with the largest model 'gemma3-27b' fitting on a single consumer RTX 3090. LTEngine is actively developed, with a roadmap outlining future enhancements and features.

github

: 57

kernel-memory

Kernel Memory (KM) is a multi-modal AI Service specialized in the efficient indexing of datasets through custom continuous data hybrid pipelines, with support for Retrieval Augmented Generation (RAG), synthetic memory, prompt engineering, and custom semantic memory processing. KM is available as a Web Service, as a Docker container, a Plugin for ChatGPT/Copilot/Semantic Kernel, and as a .NET library for embedded applications. Utilizing advanced embeddings and LLMs, the system enables Natural Language querying for obtaining answers from the indexed data, complete with citations and links to the original sources. Designed for seamless integration as a Plugin with Semantic Kernel, Microsoft Copilot and ChatGPT, Kernel Memory enhances data-driven features in applications built for most popular AI platforms.

github

: 1.8k

agentica

Agentica is a specialized Agentic AI library focused on LLM Function Calling. Users can provide Swagger/OpenAPI documents or TypeScript class types to Agentica for seamless functionality. The library simplifies AI development by handling various tasks effortlessly.

github

: 932

factorio-learning-environment

Factorio Learning Environment is an open source framework designed for developing and evaluating LLM agents in the game of Factorio. It provides two settings: Lab-play with structured tasks and Open-play for building large factories. Results show limitations in spatial reasoning and automation strategies. Agents interact with the environment through code synthesis, observation, action, and feedback. Tools are provided for game actions and state representation. Agents operate in episodes with observation, planning, and action execution. Tasks specify agent goals and are implemented in JSON files. The project structure includes directories for agents, environment, cluster, data, docs, eval, and more. A database is used for checkpointing agent steps. Benchmarks show performance metrics for different configurations.

github

: 783

agentipy

Agentipy is a powerful toolkit for interacting with the Solana blockchain, providing easy-to-use functions for token operations, trading, yield farming, LangChain integration, performance tracking, token data retrieval, pump & fun token launching, Meteora DLMM pool creation, and more. It offers features like token transfers, balance checks, staking, deploying new tokens, requesting faucet funds, trading with customizable slippage, yield farming with Lulo, and accessing LangChain tools for enhanced blockchain interactions. Users can also track current transactions per second (TPS), fetch token data by ticker or address, launch pump & fun tokens, create Meteora DLMM pools, buy/sell tokens with Raydium liquidity, and burn/close token accounts individually or in batches.

github

: 257

mcphub.nvim

MCPHub.nvim is a powerful Neovim plugin that integrates MCP (Model Context Protocol) servers into your workflow. It offers a centralized config file for managing servers and tools, with an intuitive UI for testing resources. Ideal for LLM integration, it provides programmatic API access and interactive testing through the `:MCPHub` command.

github

: 448

vision-parse

Vision Parse is a tool that leverages Vision Language Models to parse PDF documents into beautifully formatted markdown content. It offers smart content extraction, content formatting, multi-LLM support, PDF document support, and local model hosting using Ollama. Users can easily convert PDFs to markdown with high precision and preserve document hierarchy and styling. The tool supports multiple Vision LLM providers like OpenAI, LLama, and Gemini for accuracy and speed, making document processing efficient and effortless.

github

: 222

dive

Dive is an AI toolkit for Go that enables the creation of specialized teams of AI agents and seamless integration with leading LLMs. It offers a CLI and APIs for easy integration, with features like creating specialized agents, hierarchical agent systems, declarative configuration, multiple LLM support, extended reasoning, model context protocol, advanced model settings, tools for agent capabilities, tool annotations, streaming, CLI functionalities, thread management, confirmation system, deep research, and semantic diff. Dive also provides semantic diff analysis, unified interface for LLM providers, tool system with annotations, custom tool creation, and support for various verified models. The toolkit is designed for developers to build AI-powered applications with rich agent capabilities and tool integrations.

github

: 91

cua

Cua is a tool for creating and running high-performance macOS and Linux virtual machines on Apple Silicon, with built-in support for AI agents. It provides libraries like Lume for running VMs with near-native performance, Computer for interacting with sandboxes, and Agent for running agentic workflows. Users can refer to the documentation for onboarding, explore demos showcasing AI-Gradio and GitHub issue fixing, and utilize accessory libraries like Core, PyLume, Computer Server, and SOM. Contributions are welcome, and the tool is open-sourced under the MIT License.

github

: 9.7k

mem0

Mem0 is a tool that provides a smart, self-improving memory layer for Large Language Models, enabling personalized AI experiences across applications. It offers persistent memory for users, sessions, and agents, self-improving personalization, a simple API for easy integration, and cross-platform consistency. Users can store memories, retrieve memories, search for related memories, update memories, get the history of a memory, and delete memories using Mem0. It is designed to enhance AI experiences by enabling long-term memory storage and retrieval.

github

: 40.5k

vnve

VNVE is a Visual Novel Video Editor that allows users to create visual novel videos in their browser with AI-powered rapid creation. It offers a low-cost production solution for converting textual content into videos, creating interactive videos for gaming experiences, and making video teasers for novels and short video dramas. The tool is a pure front-end Typescript implementation powered by PixiJS + WebCodecs, and users can also create videos programmatically using the npm package. VNVE is tailored specifically for visual novels, focusing on text content and simplifying the video creation process for users.

github

: 142

acte

Acte is a framework designed to build GUI-like tools for AI Agents. It aims to address the issues of cognitive load and freedom degrees when interacting with multiple APIs in complex scenarios. By providing a graphical user interface (GUI) for Agents, Acte helps reduce cognitive load and constraints interaction, similar to how humans interact with computers through GUIs. The tool offers APIs for starting new sessions, executing actions, and displaying screens, accessible via HTTP requests or the SessionManager class.

github

: 113

aioshelly

Aioshelly is an asynchronous library designed to control Shelly devices. It is currently under development and requires Python version 3.11 or higher, along with dependencies like bluetooth-data-tools, aiohttp, and orjson. The library provides examples for interacting with Gen1 devices using CoAP protocol and Gen2/Gen3 devices using RPC and WebSocket protocols. Users can easily connect to Shelly devices, retrieve status information, and perform various actions through the provided APIs. The repository also includes example scripts for quick testing and usage guidelines for contributors to maintain consistency with the Shelly API.

github

: 70

educhain

Educhain is a powerful Python package that leverages Generative AI to create engaging and personalized educational content. It enables users to generate multiple-choice questions, create lesson plans, and support various LLM models. Users can export questions to JSON, PDF, and CSV formats, customize prompt templates, and generate questions from text, PDF, URL files, youtube videos, and images. Educhain outperforms traditional methods in content generation speed and quality. It offers advanced configuration options and has a roadmap for future enhancements, including integration with popular Learning Management Systems and a mobile app for content generation on-the-go.

github

: 157

For similar tasks

CrackSQL

github

: 63

For similar jobs

lollms-webui

LoLLMs WebUI (Lord of Large Language Multimodal Systems: One tool to rule them all) is a user-friendly interface to access and utilize various LLM (Large Language Models) and other AI models for a wide range of tasks. With over 500 AI expert conditionings across diverse domains and more than 2500 fine tuned models over multiple domains, LoLLMs WebUI provides an immediate resource for any problem, from car repair to coding assistance, legal matters, medical diagnosis, entertainment, and more. The easy-to-use UI with light and dark mode options, integration with GitHub repository, support for different personalities, and features like thumb up/down rating, copy, edit, and remove messages, local database storage, search, export, and delete multiple discussions, make LoLLMs WebUI a powerful and versatile tool.

github

: 4.8k

Azure-Analytics-and-AI-Engagement

The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customer’s subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.

github

: 136

minio

MinIO is a High Performance Object Storage released under GNU Affero General Public License v3.0. It is API compatible with Amazon S3 cloud storage service. Use MinIO to build high performance infrastructure for machine learning, analytics and application data workloads.

github

: 46.0k

mage-ai

Mage is an open-source data pipeline tool for transforming and integrating data. It offers an easy developer experience, engineering best practices built-in, and data as a first-class citizen. Mage makes it easy to build, preview, and launch data pipelines, and provides observability and scaling capabilities. It supports data integrations, streaming pipelines, and dbt integration.

github

: 7.8k

AiTreasureBox

AiTreasureBox is a versatile AI tool that provides a collection of pre-trained models and algorithms for various machine learning tasks. It simplifies the process of implementing AI solutions by offering ready-to-use components that can be easily integrated into projects. With AiTreasureBox, users can quickly prototype and deploy AI applications without the need for extensive knowledge in machine learning or deep learning. The tool covers a wide range of tasks such as image classification, text generation, sentiment analysis, object detection, and more. It is designed to be user-friendly and accessible to both beginners and experienced developers, making AI development more efficient and accessible to a wider audience.

github

: 368

tidb

TiDB is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability.

github

: 37.1k

airbyte

Airbyte is an open-source data integration platform that makes it easy to move data from any source to any destination. With Airbyte, you can build and manage data pipelines without writing any code. Airbyte provides a library of pre-built connectors that make it easy to connect to popular data sources and destinations. You can also create your own connectors using Airbyte's no-code Connector Builder or low-code CDK. Airbyte is used by data engineers and analysts at companies of all sizes to build and manage their data pipelines.

github

: 19.6k

labelbox-python

Labelbox is a data-centric AI platform for enterprises to develop, optimize, and use AI to solve problems and power new products and services. Enterprises use Labelbox to curate data, generate high-quality human feedback data for computer vision and LLMs, evaluate model performance, and automate tasks by combining AI and human-centric workflows. The academic & research community uses Labelbox for cutting-edge AI research.

github

: 135