
LLM-on-Tabular-Data-Prediction-Table-Understanding-Data-Generation
Repository for collecting and categorizing papers outlined in our survey paper: "Large Language Models on Tabular Data -- A Survey".
Stars: 68

This repository serves as a comprehensive survey on the application of Large Language Models (LLMs) on tabular data, focusing on tasks such as prediction, data generation, and table understanding. It aims to consolidate recent progress in this field by summarizing key techniques, metrics, datasets, models, and optimization approaches. The survey identifies strengths, limitations, unexplored territories, and gaps in the existing literature, providing insights for future research directions. It also offers code and dataset references to empower readers with the necessary tools and knowledge to address challenges in this rapidly evolving domain.
README:
@article{
fang2024large,
title={Large Language Models ({LLM}s) on Tabular Data: Prediction, Generation, and Understanding - A Survey},
author={Xi Fang and Weijie Xu and Fiona Anting Tan and Ziqing Hu and Jiani Zhang and Yanjun Qi and Srinivasan H. Sengamedu and Christos Faloutsos},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2024},
url={https://openreview.net/forum?id=IZnrCGF9WI},
note={}
}
This repo is constructed for collecting and categorizing papers about diffusion models according to our survey paper——Large Language Models on Tabular Data -- A Survey. Considering the fast development of this field, we will continue to update both arxiv paper and this repo.
Abstract
Recent breakthroughs in large language modeling have facilitated rigorous exploration of their application in diverse tasks related to tabular data modeling, such as prediction, tabular data synthesis, question answering, and table understanding. Each task presents unique challenges and opportunities. However, there is currently a lack of comprehensive review that summarizes and compares the key techniques, metrics, datasets, models, and optimization approaches in this research domain. This survey aims to address this gap by consolidating recent progress in these areas, offering a thorough survey and taxonomy of the datasets, metrics, and methodologies utilized. It identifies strengths, limitations, unexplored territories, and gaps in the existing literature, while providing some insights for future research directions in this vital and rapidly evolving field. It also provides relevant code and datasets references. Through this comprehensive review, we hope to provide interested readers with pertinent references and insightful perspectives, empowering them with the necessary tools and knowledge to effectively navigate and address the prevailing challenges in the field.
Figure 1: Overview of LLM on Tabular Data: the paper discusses application of LLM for prediction, data
generation, and table understanding tasks.
Figure 4: Key techniques in using LLMs for tabular data. The dotted line indicates steps that are optional.
Table of content:
TABLET: Learning From Instructions For Tabular Data [code]
Language models are weak learners
LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks
[code]
TabLLM: Few-shot Classification of Tabular Data with Large Language Models
[code]
UniPredict: Large Language Models are Universal Tabular Classifiers
Towards Foundation Models for Learning on Tabular Data
Towards Better Serialization of Tabular Data for Few-shot Classification with Large Language Models
Multimodal clinical pseudo-notes for emergency department prediction tasks using multiple embedding model for ehr (meme) **[code]
StructLM: Towards Building Generalist Models for Structured Knowledge Grounding
UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model in Data Science
Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science [model]
Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance
LLMs learn governing principles of dynamical systems, revealing an in-context neural scaling law
PromptCast: A New Prompt-based Learning Paradigm for Time Series Forecasting
Large Language Models Are Zero-Shot Time Series Forecasters
TEST: Text Prototype Aligned Embedding to Activate LLM's Ability for Time Series
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models
[code]
MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement
[code]
CPLLM: Clinical Prediction with Large Language Models
[code]
CTRL: Connect Collaborative and Language Model for CTR Prediction
FinGPT: Open-Source Financial Large Language Models
[code]
Language Models are Realistic Tabular Data Generators [code]
REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers
Generative Table Pre-training Empowers Models for Tabular Prediction [code]
TabuLa: Harnessing Language Models for Tabular Data Synthesis [code]
Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in ultra low-data regimes
TabMT: Generating tabular data with masked transformers
Elephants Never Forget: Testing Language Models for Memorization of Tabular Data
Graph-to-Text Generation with Dynamic Structure Pruning
Plan-then-Seam: Towards Efficient Table-to-Text Generation
Differentially Private Tabular Data Synthesis using Large Language Models
Pythia: Unsupervised Generation of Ambiguous Textual Claims from Relational Data
TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT
Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning [code]
PACIFIC: Towards Proactive Conversational Question Answering over Tabular and Textual Data in Finance [code]
Large Language Models are few(1)-shot Table Reasoners [code]
cTBLS: Augmenting Large Language Models with Conversational Tables [code]
Large Language Models are Complex Table Parsers
Rethinking Tabular Data Understanding with Large Language Models [code]
TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT
Testing the Limits of Unified Sequence to Sequence LLM Pretraining on Diverse Table Data Tasks
Unified Language Representation for Question Answering over Text, Tables, and Images
SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models [code]
TableLlama: Towards Open Large Generalist Models for Tables [code]
StructGPT: A General Framework for Large Language Model to Reason over Structured Data [code]
JarviX: A LLM No code Platform for Tabular Data Analysis and Optimization
CABINET: Content Relevance-based Noise Reduction for Table Question Answering **[code]
Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [code]
Querying Large Language Models with SQL
Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation
DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction [code]
C3: Zero-shot Text-to-SQL with ChatGPT [code]
DBCopilot: Scaling Natural Language Querying to Massive Databases [code]
Bridging the Gap: Deciphering Tabular Data Using Large Language Model
TableQuery: Querying tabular data with natural language [code]
S2SQL: Injecting Syntax to Question-Schema Interaction Graph Encoder for Text-to-SQL Parsers
Dynamic hybrid relation network for cross-domain context-dependent semantic parsing
STAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing
SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers
Towards Generalizable and Robust Text-to-SQL Parsing
Before Generation, Align it! A Novel and Effective Strategy for Mitigating Hallucinations in Text-to-SQL Generation [code]
Robust (Controlled) Table-to-Text Generation with Structure-Aware Equivariance Learning [code]
Table-based Fact Verification with Salience-aware Learning [code]
Cocoon: Semantic Table Profiling Using Large Language Models [code]
Relationalizing Tables with Large Language Models: The Promise and Challenges
Disambiguate Entity Matching using Large Language Models through Relation Discovery [code]
Please refer to our paper to see relevant methods that benchmark on these datasets.
Dataset | Dataset Number | Dataset Repo |
---|---|---|
OpenML | 11 | https://github.com/UW-Madison-Lee-Lab/LanguageInterfacedFineTuning/tree/master/regression/realdata/data |
Kaggle API | 169 | https://github.com/Kaggle/kaggle-api |
Combo | 9 | https://github.com/clinicalml/TabLLM/tree/main/datasets |
UCI ML | 20 | https://github.com/dylan-slack/Tablet/tree/main/data/benchmark/performance |
DDX | 10 | https://github.com/dylan-slack/Tablet/tree/main/data/ddx_data_no_instructions/benchmark |
Dataset | # Tables | Task Type | Input | Output | Data Source | Dataset Repo |
---|---|---|---|---|---|---|
FetaQA | 10330 | QA | Table Question | Answer | Wikipedia | https://github.com/Yale-LILY/FeTaQA |
WikiTableQuestion | 2108 | QA | Table Question | Answer | Wikipedia | https://ppasupat.github.io/WikiTableQuestions/ |
NQ-TABLES | 169898 | QA | Question, Table | Answer | Synthetic | https://github.com/google-research-datasets/natural-questions |
HybriDialogue | 13000 | QA | Conversation, Table, Reference | Answer | Wikipedia | https://github.com/entitize/HybridDialogue |
TAT-QA | 2757 | QA | Question, Table | Answer | Financial report | https://github.com/NExTplusplus/TAT-QA |
HiTAB | 3597 | QA/NLG | Question, Table | Answer | Statistical Report and Wikipedia | https://github.com/microsoft/HiTab |
ToTTo | 120000 | NLG | Table | Sentence | Wikipedia | https://github.com/google-research-datasets/ToTTo |
FEVEROUS | 28800 | Classification | Claim, Table | Label | Common Crawl | https://fever.ai/dataset/feverous.html |
Dresden Web Tables | 125M | Classification | Table | Label | Common Crawl | https://ppasupat.github.io/WikiTableQuestions/ |
InfoTabs | 2540 | NLI | Table , Hypothesis | Label | Wikipedia | https://infotabs.github.io/ |
TabFact | 16573 | NLI | Table, Statement | Label | Wikipedia | https://tabfact.github.io/ |
TAPEX | 1500 | Text2SQL | SQL, Table | Answer | Synthetic | https://github.com/google-research/tapas |
Spider | 1020 | Text2SQL | Table, Question | SQL | Human annotation | https://drive.usercontent.google.com/download?id=1iRDVHLr4mX2wQKSgA9J8Pire73Jahh0m&export=download&authuser=0 |
WIKISQL | 24241 | Text2SQL | Table, Question | SQL, Answer | Human Annotated | https://github.com/salesforce/WikiSQL |
BIRD | 12751 | Text2SQL | Table, Question | SQL | Human Annotated | https://bird-bench.github.io/ |
Tapilot-Crossing | 5 | Text2Code, QA, RAG | Table, Dialog History, Question, Private Lib, Chart | Python, Private Lib Code, Answer | Human-Agent Interaction | https://tapilot-crossing.github.io/ |
A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions
If you would like to contribute to this list or writeup, feel free to submit a pull request!
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for LLM-on-Tabular-Data-Prediction-Table-Understanding-Data-Generation
Similar Open Source Tools

LLM-on-Tabular-Data-Prediction-Table-Understanding-Data-Generation
This repository serves as a comprehensive survey on the application of Large Language Models (LLMs) on tabular data, focusing on tasks such as prediction, data generation, and table understanding. It aims to consolidate recent progress in this field by summarizing key techniques, metrics, datasets, models, and optimization approaches. The survey identifies strengths, limitations, unexplored territories, and gaps in the existing literature, providing insights for future research directions. It also offers code and dataset references to empower readers with the necessary tools and knowledge to address challenges in this rapidly evolving domain.

AgentNeo
AgentNeo is an advanced, open-source Agentic AI Application Observability, Monitoring, and Evaluation Framework designed to provide deep insights into AI agents, Large Language Model (LLM) calls, and tool interactions. It offers robust logging, visualization, and evaluation capabilities to help debug and optimize AI applications with ease. With features like tracing LLM calls, monitoring agents and tools, tracking interactions, detailed metrics collection, flexible data storage, simple instrumentation, interactive dashboard, project management, execution graph visualization, and evaluation tools, AgentNeo empowers users to build efficient, cost-effective, and high-quality AI-driven solutions.

agentneo
AgentNeo is a Python package that provides functionalities for project, trace, dataset, experiment management. It allows users to authenticate, create projects, trace agents and LangGraph graphs, manage datasets, and run experiments with metrics. The tool aims to streamline AI project management and analysis by offering a comprehensive set of features.

LLM-Agent-Survey
LLM-Agent-Survey is a comprehensive repository that provides a curated list of papers related to Large Language Model (LLM) agents. The repository categorizes papers based on LLM-Profiled Roles and includes high-quality publications from prestigious conferences and journals. It aims to offer a systematic understanding of LLM-based agents, covering topics such as tool use, planning, and feedback learning. The repository also includes unpublished papers with insightful analysis and novelty, marked for future updates. Users can explore a wide range of surveys, tool use cases, planning workflows, and benchmarks related to LLM agents.

sktime
sktime is a Python library for time series analysis that provides a unified interface for various time series learning tasks such as classification, regression, clustering, annotation, and forecasting. It offers time series algorithms and tools compatible with scikit-learn for building, tuning, and validating time series models. sktime aims to enhance the interoperability and usability of the time series analysis ecosystem by empowering users to apply algorithms across different tasks and providing interfaces to related libraries like scikit-learn, statsmodels, tsfresh, PyOD, and fbprophet.

spark-nlp
Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides simple, performant, and accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment. Spark NLP comes with 36000+ pretrained pipelines and models in more than 200+ languages. It offers tasks such as Tokenization, Word Segmentation, Part-of-Speech Tagging, Named Entity Recognition, Dependency Parsing, Spell Checking, Text Classification, Sentiment Analysis, Token Classification, Machine Translation, Summarization, Question Answering, Table Question Answering, Text Generation, Image Classification, Image to Text (captioning), Automatic Speech Recognition, Zero-Shot Learning, and many more NLP tasks. Spark NLP is the only open-source NLP library in production that offers state-of-the-art transformers such as BERT, CamemBERT, ALBERT, ELECTRA, XLNet, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, Longformer, ELMO, Universal Sentence Encoder, Llama-2, M2M100, BART, Instructor, E5, Google T5, MarianMT, OpenAI GPT2, Vision Transformers (ViT), OpenAI Whisper, and many more not only to Python and R, but also to JVM ecosystem (Java, Scala, and Kotlin) at scale by extending Apache Spark natively.

awesome-azure-openai-llm
This repository is a collection of references to Azure OpenAI, Large Language Models (LLM), and related services and libraries. It provides information on various topics such as RAG, Azure OpenAI, LLM applications, agent design patterns, semantic kernel, prompting, finetuning, challenges & abilities, LLM landscape, surveys & references, AI tools & extensions, datasets, and evaluations. The content covers a wide range of topics related to AI, machine learning, and natural language processing, offering insights into the latest advancements in the field.

FlowDown-App
FlowDown is a blazing fast and smooth client app for using AI/LLM. It is lightweight and efficient with markdown support, universal compatibility, blazing fast text rendering, automated chat titles, and privacy by design. There are two editions available: FlowDown and FlowDown Community, with various features like chat with AI, fast markdown, privacy by design, bring your own LLM, offline LLM w/ MLX, visual LLM, web search, attachments, and language localization. FlowDown Community is now open-source, empowering developers to build interactive and responsive AI client apps.

unilm
The 'unilm' repository is a collection of tools, models, and architectures for Foundation Models and General AI, focusing on tasks such as NLP, MT, Speech, Document AI, and Multimodal AI. It includes various pre-trained models, such as UniLM, InfoXLM, DeltaLM, MiniLM, AdaLM, BEiT, LayoutLM, WavLM, VALL-E, and more, designed for tasks like language understanding, generation, translation, vision, speech, and multimodal processing. The repository also features toolkits like s2s-ft for sequence-to-sequence fine-tuning and Aggressive Decoding for efficient sequence-to-sequence decoding. Additionally, it offers applications like TrOCR for OCR, LayoutReader for reading order detection, and XLM-T for multilingual NMT.

cia
CIA is a powerful open-source tool designed for data analysis and visualization. It provides a user-friendly interface for processing large datasets and generating insightful reports. With CIA, users can easily explore data, perform statistical analysis, and create interactive visualizations to communicate findings effectively. Whether you are a data scientist, analyst, or researcher, CIA offers a comprehensive set of features to streamline your data analysis workflow and uncover valuable insights.

app_generative_ai
This repository contains course materials for T81 559: Applications of Generative Artificial Intelligence at Washington University in St. Louis. The course covers practical applications of Large Language Models (LLMs) and text-to-image networks using Python. Students learn about generative AI principles, LangChain, Retrieval-Augmented Generation (RAG) model, image generation techniques, fine-tuning neural networks, and prompt engineering. Ideal for students, researchers, and professionals in computer science, the course offers a transformative learning experience in the realm of Generative AI.

actor-core
Actor-core is a lightweight and flexible library for building actor-based concurrent applications in Java. It provides a simple API for creating and managing actors, as well as handling message passing between actors. With actor-core, developers can easily implement scalable and fault-tolerant systems using the actor model.

skpro
skpro is a library for supervised probabilistic prediction in python. It provides `scikit-learn`-like, `scikit-base` compatible interfaces to: * tabular **supervised regressors for probabilistic prediction** \- interval, quantile and distribution predictions * tabular **probabilistic time-to-event and survival prediction** \- instance-individual survival distributions * **metrics to evaluate probabilistic predictions** , e.g., pinball loss, empirical coverage, CRPS, survival losses * **reductions** to turn `scikit-learn` regressors into probabilistic `skpro` regressors, such as bootstrap or conformal * building **pipelines and composite models** , including tuning via probabilistic performance metrics * symbolic **probability distributions** with value domain of `pandas.DataFrame`-s and `pandas`-like interface

Rankify
Rankify is a Python toolkit designed for unified retrieval, re-ranking, and retrieval-augmented generation (RAG) research. It integrates 40 pre-retrieved benchmark datasets and supports 7 retrieval techniques, 24 state-of-the-art re-ranking models, and multiple RAG methods. Rankify provides a modular and extensible framework, enabling seamless experimentation and benchmarking across retrieval pipelines. It offers comprehensive documentation, open-source implementation, and pre-built evaluation tools, making it a powerful resource for researchers and practitioners in the field.

eko
Eko is a lightweight and flexible command-line tool for managing environment variables in your projects. It allows you to easily set, get, and delete environment variables for different environments, making it simple to manage configurations across development, staging, and production environments. With Eko, you can streamline your workflow and ensure consistency in your application settings without the need for complex setup or configuration files.

rivet
Rivet is a desktop application for creating complex AI agents and prompt chaining, and embedding it in your application. Rivet currently has LLM support for OpenAI GPT-3.5 and GPT-4, Anthropic Claude Instant and Claude 2, [Anthropic Claude 3 Haiku, Sonnet, and Opus](https://www.anthropic.com/news/claude-3-family), and AssemblyAI LeMUR framework for voice data. Rivet has embedding/vector database support for OpenAI Embeddings and Pinecone. Rivet also supports these additional integrations: Audio Transcription from AssemblyAI. Rivet core is a TypeScript library for running graphs created in Rivet. It is used by the Rivet application, but can also be used in your own applications, so that Rivet can call into your own application's code, and your application can call into Rivet graphs.
For similar tasks

LLM-on-Tabular-Data-Prediction-Table-Understanding-Data-Generation
This repository serves as a comprehensive survey on the application of Large Language Models (LLMs) on tabular data, focusing on tasks such as prediction, data generation, and table understanding. It aims to consolidate recent progress in this field by summarizing key techniques, metrics, datasets, models, and optimization approaches. The survey identifies strengths, limitations, unexplored territories, and gaps in the existing literature, providing insights for future research directions. It also offers code and dataset references to empower readers with the necessary tools and knowledge to address challenges in this rapidly evolving domain.
For similar jobs

sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.