awesome-ai4db-paper

awesome-ai4db-paper

Paper related to AI4DB techniques

Stars: 53

Visit
 screenshot

The 'awesome-ai4db-paper' repository is a curated paper list focusing on AI for database (AI4DB) theory, frameworks, resources, and tools for data engineers. It includes a collection of research papers related to learning-based query optimization, training data set preparation, cardinality estimation, query-driven approaches, data-driven techniques, hybrid methods, pretraining models, plan hints, cost models, SQL embedding, join order optimization, query rewriting, end-to-end systems, text-to-SQL conversion, traditional database technologies, storage solutions, learning-based index design, and a learning-based configuration advisor. The repository aims to provide a comprehensive resource for individuals interested in AI applications in the field of database management.

README:

AI4DB Paper Sets 666

Awesome AI4DB Paper visitor badge GitHub Repo stars GitHub Repo forks

Introduction

A curated paper list of awesome AI4DB theory, frameworks, resources, tools and other awesomeness, for data engineers.

Contributing

The repository is under construction. Welcome new PR, please conform to the committed rules:

paperName(with pdf link) [MeetingName Year] Github link if it has open-sourced code (optional)

Acknowledge

Thanks to all authors of the paper/repository I cite :D

Table of Content

Learning-based Query Optimization

  1. LEON: A New Framework for ML-Aided Query Optimization [VLDB 23]
  2. LOGER: A Learned Optimizer towards Generating Efficient and Robust Query Execution Plans [VLDB 23]
  3. Eraser: Eliminating Performance Regression on Learned Query Optimizer [VLDB 24]
  4. AutoSteer: Learned Query Optimization for Any SQL Database [VLDB 24]
  5. Modeling Shifting Workloads for Learned Database Systems [SIGMOD 24]
  6. Stage: Query Execution Time Prediction in Amazon Redshift [SIGMOD 24]
  7. Roq: Robust Query Optimization Based on a Risk-aware Learned Cost Model [arXiv 24]
  8. RobOpt: A Tool for Robust Workload Optimization Based on Uncertainty-Aware Machine Learning [SIGMOD Demo 24]
  9. Towards Exploratory Query Optimization for Template-based SQL Workloads [ICDE 24]

Training Data Set Preparation

  1. DSB: a decision support benchmark for workload-driven and traditional database systems [VLDB 21]
  2. Expand your training limits! generating training data for ml-based data management [VLDB 21]
  3. LearnedSQLGen: Constraint-aware SQL Generation using Reinforcement Learning [SIGMOD 22]
  4. Hit the Gym: Accelerating Query Execution to Efficiently Bootstrap Behavior Models for Self-Driving Database Management Systems [VLDB 24]

Survey

  1. Machine Unlearning in Learned Databases: An Experimental Analysis [SIGMOD 24]

Cardinality Estimation

Survey

  1. Cardinality Estimation: An Experimental Survey [VLDB 17]
  2. Are We Ready For Learned Cardinality Estimation? [VLDB 21]
  3. Cardinality Estimation in DBMS: A Comprehensive Benchmark Evaluation [VLDB 21]
  4. Learned cardinality estimation: A design space exploration and a comparative evaluation [VLDB 22]
  5. Learned Cardinality Estimation: An In-depth Study [SIGMOD 22]
  6. A Comparative Study and Component Analysis of Query Plan Representation Techniques in ML4DB Studies [VLDB 24]

Query-Driven

Single-Table
  1. Selectivity estimation for range predicates using lightweight models [VLDB 19]
  2. Deep learning models for selectivity estimation of multiattribute queries [SIGMOD 20]
Multi-Tables
  1. Learned Cardinalities: Estimating Correlated Joins with Deep Learning [CIDR 2019]
  2. An End-to-End Learning-based Cost Estimator [VLDB 19]
  3. Flow-Loss: Learning Cardinality Estimates That Matter [VLDB 21]
  4. Speeding Up End-to-end Query Execution via Learning-based Progressive Cardinality Estimation [SIGMOD 23]
  5. Robust Query Driven Cardinality Estimation under Changing Workloads[VLDB 23]
  6. AutoCE: An Accurate and Efficient Model Advisor for Learned Cardinality Estimation [ICDE 23]
  7. Asm: Harmonizing autoregressive model, sampling, and multi-dimensional statistics merging for cardinality estimation [SIGMOD 24]
  8. Adding Domain Knowledge to Query-Driven Learned Databases [SIGMOD 24]

Data-Driven

Single-Table
  1. Self-tuning, gpu-accelerated kernel density models for multidimensional selectivity estimation [SIGMOD 15]
  2. Deep Unsupervised Cardinality Estimation [VLDB 19]
  3. Quicksel: Quick selectivity learning with mixture models [SIGMOD 20]
  4. Pre-training Summarization Models of Structured Datasets for Cardinality Estimation [VLDB 22]
Multi-Tables
  1. DeepDB: Learn from Data, not from Queries! [VLDB 20]
  2. NeuroCard: One Cardinality Estimator for All Tables [VLDB 21]
  3. FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation [VLDB 21]
  4. BayesCard: Revitilizing Bayesian Frameworks for Cardinality Estimation [aiXiv 21]
  5. Glue: Adaptively Merging Single Table Cardinality to Estimate Join Query Size [aiXiv 21]
  6. Fauce: fast and accurate deep ensembles with uncertainty for cardinality estimation [VLDB 21]
  7. FACE: a normalizing flow based cardinality estimator [VLDB 22]
  8. FactorJoin: A New Cardinality Estimation Framework for Join Queries [SIGMOD 22] (Bounded)
  9. Cardinality Estimation of LIKE Predicate Queries using Deep Learning [SIGMOD 25]

Hybrid

  1. A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation [SIGMOD 21]

Pretrain

  1. PRICE: A Pretrained Model for Cross-Database Cardinality Estimation [arXiv 24]

Plan Hints

  1. Bao: Making Learned Query Optimization Practical [SIMOD 21]
  2. COOOL: A Learning-To-Rank Approach for SQL Hint Recommendations [VLDB 23]

Cost Model

  1. Cost-based or Learning-based? A Hybrid Query Optimizer for Query Plan Selection [VLDB 22]
  2. Lero: A Learning-to-Rank Qery Optimizer [VLDB 23]
  3. Lero: applying learning-to-rank in query optimizer [VLDBJ 24]

SQL Embedding

  1. PreQR: Pre-training Representation for SQL Understanding [SIGMOD 22]

Join Order

  1. Learning to Optimize Join queries With Deep Reinforcement Learning [SIGMOD 16]
  2. Deep Reinforcement Learning for Join Order Enumeration[arXiv 18]
  3. Reinforcement Learning with Tree-LSTM for Join Order Selection [ICDE 20]

Query Rewrite

  1. A Learned Query Rewrite System using Monte Carlo Tree Search [VLDB 22]

End-to-end System

  1. Modeling Shifting Workloads for Learned Database Systems [SIGMOD 24]
  2. Db2une: Tuning Under Pressure via Deep Learning [VLDB 24]

Text2SQL

  1. The Dawn of Natural Language to SQL: Are We Fully Ready? [VLDB 24]

Database Traditional Technology

Storage

  1. Cosine: A Cloud-Cost Optimized Self-Designing Key-Value Storage Engine [VLDB 22]
  2. TreeLine: An Update-In-Place Key-Value Store for Modern Storage [VLDB 22]
  3. Learning to Optimize LSM-trees: Towards A Reinforcement Learning based Key-Value Store for Dynamic Workloads [SIGMOD 23]
  4. Limousine: Blending Learned and Classical Indexes to Self-Design Larger-than-Memory Cloud Storage Engines [SIGMOD 24]

Learning-based Index Design

Single-dimensional

  1. The Case for Learned Index Structures [SIGMOD 18]
  2. FITing-Tree: A Data-aware Index Structure [SIGMOD 19]
  3. ALEX: An Updatable Adaptive Learned Index [aiXiv 20]
  4. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds [VLDB 20]
  5. RadixSpline: a single-pass learned index [aiDM 20]
  6. Why Are Learned Indexes So Effective? [ICML 20]
  7. A Pluggable Learned Index Method via Sampling and Gap Insertion [aiXiv 21]
  8. Updatable Learned Index with Precise Positions [VLDB 21]
  9. The next 50 years in database indexing or: the case for automatically generated index structures [VLDB 21]
  10. Tuning Hierarchical Learned Indexes on Disk and Beyond [SIGMOD 22]
  11. APEX: A High-Performance Learned Index on Persistent Memory [VLDB 22]
  12. FINEdex: A Fine-grained Learned Index Scheme for Scalable and Concurrent Memory Systems [VLDB 22]
  13. Are Updatable Learned Indexes Ready? [VLDB 22]
  14. CARMI: A Cache-Aware Learned Index with a Cost-based Construction Algorithm [VLDB 22]
  15. NFL: Robust Learned Index via Distribution Transformation [VLDB 22]
  16. Cutting Learned Index into Pieces: An In-depth Inquiry into Updatable Learned Indexes [ICDE 23]

Multi-dimensional

  1. Learning Multi-dimensional Indexes [SIGMOD 20]
  2. LISA: A Learned Index Structure for Spatial Data [SIGMOD 20]
  3. Effectively Learning Spatial Indices [VLDB 20]
  4. The ML-Index: A Multidimensional, Learned Index for Point, Range, and Nearest-Neighbor Queries [EDBT 20]
  5. Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads [VLDB 21]
  6. NEIST: a Neural-Enhanced Index for Spatio-Temporal Queries [TKDE 21]
  7. RW-Tree: A Learned Workload-aware Framework for R-tree Construction [ICDE 22]

Learning-based Configuration Advisor

Index Advisor

  1. The Case for Automatic Database Administration using Deep Reinforcement Learning [arXiv 18]
  2. AI Meets AI: Leveraging Query Executions to Improve Index Recommendations [SIGMOD 19]
  3. Online Index Selection Using Deep Reinforcement Learning for a Cluster Database [ICDEW 20]
  4. SMARTIX: A database indexing agent based on reinforcement learning [Applied Intelligence 20]
  5. Magic mirror in my hand, which is the best in the land? An Experimental Evaluation of Index Selection Algorithms [VLDB 20]
  6. An Index Advisor Using Deep Reinforcement Learning [CIKM 20]
  7. Automated Database Indexing Using Model-Free Reinforcement Learning [ICAPS 20]
  8. DBA bandits: Self-driving index tuning under ad-hoc, analytical workloads with safety guarantees [ICDE 21]
  9. Index selection for NoSQL database with deep reinforcement learning [Information Sciences 21]
  10. MANTIS: Multiple Type and Attribute Index Selection using Deep Reinforcement Learning [IDEAS 21]
  11. AutoIndex: An Incremental Index Management System for Dynamic Workloads [ICDE 22]
  12. SWIRL: Selection of Workload-aware Indexes using Reinforcement Learning [EDBT 22]
  13. Indexer++: Workload-Aware Online Index Tuning with Transformers and Reinforcement Learning [SAC 22]
  14. Budget-aware Index Tuning with Reinforcement Learning [SIGMOD 22]
  15. ISUM: Efficiently Compressing Large and Complex Workloads for Scalable Index Tuning [SIGMOD 22]
  16. DISTILL: low-overhead data-driven techniques for filtering and costing indexes for scalable index tuning [VLDB 22]
  17. HMAB: Self-Driving Hierarchy of Bandits for Integrated Physical Database Design Tuning [VLDB 22]
  18. SmartIndex: An Index Advisor with Learned Cost Estimator [CIKM 22]
  19. Learned Index Benefits: Machine Learning Based Index Performance Estimation [VLDB 23]
  20. No DBA? No Regret! Multi-Armed Bandits for Index Tuning of Analytical and HTAP Workloads With Provable Guarantees [TKDE 23]
  21. IA2: Leveraging Instance-Aware Index Advisor with Reinforcement Learning for Diverse Workloads [EuroMLSys 24]
  22. Leveraging Dynamic and Heterogeneous Workload Knowledge to Boost the Performance of Index Advisors [PVLDB 24]
  23. Refactoring Index Tuning Process with Benefit Estimation [PVLDB 24]
  24. Breaking It Down: An In-Depth Study of Index Advisors [PVLDB 24]
  25. TRAP: Tailored Robustness Assessment for Index Advisors via Adversarial Perturbation [ICDE 24]
  26. Automatic Database Index Tuning: A Survey [TKDE 24]
  27. Robustness of Updatable Learning-based Index Advisors against Poisoning Attack [SIGMOD 24]
  28. Wii: Dynamic Budget Reallocation In Index Tuning [SIGMOD 24]
  29. Wred: Workload Reduction for Scalable Index Tuning [SIGMOD 24]
  30. ML-Powered Index Tuning: An Overview of Recent Progress and Open Challenges [SIGMOD 24]

Database Self-Tuning

  1. Automatic Database Management System Tuning Through Large-scale Machine Learning [SIGMOD 17]
  2. Deploying a Steered Query Optimizer in Production at Microsof [SIGMOD 22]
  3. Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data [SIGMOD 23]
  4. AutoSteer: Learned Query Optimization for Any SQL Database [SIGMOD 23]
  5. Auto-WLM: Machine Learning Enhanced Workload Management in Amazon Redshif [SIGMOD 23]

LLM

  1. GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization [VLDB 24]
  2. D-Bot: Database Diagnosis System using Large Language Models [VLDB 24]
  3. LLMTune: Accelerate Database Knob Tuning with Large Language Models [VLDB 24]
  4. ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models [VLDB 24]
  5. A Survey on Large Language Models for Code Generation [arXiv 24]
  6. Fuzz4All: Universal Fuzzing with Large Language Models [ICSE 24]
  7. LLM-PBE: Assessing Data Privacy in Large Language Models [VLDB 24]
  8. Are Large Language Models a Good Replacement of Taxonomies? [VLDB 24]
  9. 𝜆-Tune: Harnessing Large Language Models for Automated Database System Tuning [SIGMOD 25]
  10. LLM-R2 : A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency [VLDB 25]
  11. Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models [VLDB 25]
Star History Chart

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for awesome-ai4db-paper

Similar Open Source Tools

For similar tasks

For similar jobs