LLM-on-Tabular-Data-Prediction-Table-Understanding-Data-Generation

Repository for collecting and categorizing papers outlined in our survey paper: "Large Language Models on Tabular Data -- A Survey".

Stars: 68

Visit

This repository serves as a comprehensive survey on the application of Large Language Models (LLMs) on tabular data, focusing on tasks such as prediction, data generation, and table understanding. It aims to consolidate recent progress in this field by summarizing key techniques, metrics, datasets, models, and optimization approaches. The survey identifies strengths, limitations, unexplored territories, and gaps in the existing literature, providing insights for future research directions. It also offers code and dataset references to empower readers with the necessary tools and knowledge to address challenges in this rapidly evolving domain.

README:

Large Language Models on Tabular Data -- A Survey

@article{
fang2024large,
title={Large Language Models ({LLM}s) on Tabular Data: Prediction, Generation, and Understanding - A Survey},
author={Xi Fang and Weijie Xu and Fiona Anting Tan and Ziqing Hu and Jiani Zhang and Yanjun Qi and Srinivasan H. Sengamedu and Christos Faloutsos},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2024},
url={https://openreview.net/forum?id=IZnrCGF9WI},
note={}
}

Original paper

LLM on Tabular Data Prediction and Understanding -- A Survey

This repo is constructed for collecting and categorizing papers about diffusion models according to our survey paper——Large Language Models on Tabular Data -- A Survey. Considering the fast development of this field, we will continue to update both arxiv paper and this repo.

Abstract
Recent breakthroughs in large language modeling have facilitated rigorous exploration of their application in diverse tasks related to tabular data modeling, such as prediction, tabular data synthesis, question answering, and table understanding. Each task presents unique challenges and opportunities. However, there is currently a lack of comprehensive review that summarizes and compares the key techniques, metrics, datasets, models, and optimization approaches in this research domain. This survey aims to address this gap by consolidating recent progress in these areas, offering a thorough survey and taxonomy of the datasets, metrics, and methodologies utilized. It identifies strengths, limitations, unexplored territories, and gaps in the existing literature, while providing some insights for future research directions in this vital and rapidly evolving field. It also provides relevant code and datasets references. Through this comprehensive review, we hope to provide interested readers with pertinent references and insightful perspectives, empowering them with the necessary tools and knowledge to effectively navigate and address the prevailing challenges in the field.

Figure 1: Overview of LLM on Tabular Data: the paper discusses application of LLM for prediction, data generation, and table understanding tasks.

Figure 4: Key techniques in using LLMs for tabular data. The dotted line indicates steps that are optional.

Table of content:

Taxonomy
Datasets

Taxonomy

Prediction task

Table understanding

Numeric Question Answering

DocMath-Eval: Evaluating Numerical Reasoning Capabilities of LLMs in Understanding Long Documents with Tabular Data

Exploring the Numerical Reasoning Capabilities of Language Models: A Comprehensive Analysis on Tabular Data

TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT

Question Answering

Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning [code]

PACIFIC: Towards Proactive Conversational Question Answering over Tabular and Textual Data in Finance [code]

Large Language Models are few(1)-shot Table Reasoners [code]

cTBLS: Augmenting Large Language Models with Conversational Tables [code]

Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

Large Language Models are Complex Table Parsers

Rethinking Tabular Data Understanding with Large Language Models [code]

TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT

Testing the Limits of Unified Sequence to Sequence LLM Pretraining on Diverse Table Data Tasks

Unified Language Representation for Question Answering over Text, Tables, and Images

SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models [code]

TableLlama: Towards Open Large Generalist Models for Tables [code]

DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain Question Answering over Knowledge Base and Text

StructGPT: A General Framework for Large Language Model to Reason over Structured Data [code]

JarviX: A LLM No code Platform for Tabular Data Analysis and Optimization

CABINET: Content Relevance-based Noise Reduction for Table Question Answering **[code]

Traffic Performance GPT (TP-GPT): Real-Time Data Informed Intelligent ChatBot for Transportation Surveillance and Management

Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [code]

Querying Large Language Models with SQL

Text2SQL

Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation

DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction [code]

C3: Zero-shot Text-to-SQL with ChatGPT [code]

DBCopilot: Scaling Natural Language Querying to Massive Databases [code]

Bridging the Gap: Deciphering Tabular Data Using Large Language Model

TableQuery: Querying tabular data with natural language [code]

S2SQL: Injecting Syntax to Question-Schema Interaction Graph Encoder for Text-to-SQL Parsers

Dynamic hybrid relation network for cross-domain context-dependent semantic parsing

STAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing

SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers

Towards Generalizable and Robust Text-to-SQL Parsing

Before Generation, Align it! A Novel and Effective Strategy for Mitigating Hallucinations in Text-to-SQL Generation [code]

Table2Text

Robust (Controlled) Table-to-Text Generation with Structure-Aware Equivariance Learning [code]

Fact Verification

Table-based Fact Verification with Salience-aware Learning [code]

Table Profiling

Cocoon: Semantic Table Profiling Using Large Language Models [code]

Table Transformation

Relationalizing Tables with Large Language Models: The Promise and Challenges

Entity Matching

Disambiguate Entity Matching using Large Language Models through Relation Discovery [code]

Datasets

Please refer to our paper to see relevant methods that benchmark on these datasets.

Prediction Tasks

Dataset	Dataset Number	Dataset Repo
OpenML	11	https://github.com/UW-Madison-Lee-Lab/LanguageInterfacedFineTuning/tree/master/regression/realdata/data
Kaggle API	169	https://github.com/Kaggle/kaggle-api
Combo	9	https://github.com/clinicalml/TabLLM/tree/main/datasets
UCI ML	20	https://github.com/dylan-slack/Tablet/tree/main/data/benchmark/performance
DDX	10	https://github.com/dylan-slack/Tablet/tree/main/data/ddx_data_no_instructions/benchmark

Table Understanding Tasks

Dataset	# Tables	Task Type	Input	Output	Data Source	Dataset Repo
FetaQA	10330	QA	Table Question	Answer	Wikipedia	https://github.com/Yale-LILY/FeTaQA
WikiTableQuestion	2108	QA	Table Question	Answer	Wikipedia	https://ppasupat.github.io/WikiTableQuestions/
NQ-TABLES	169898	QA	Question, Table	Answer	Synthetic	https://github.com/google-research-datasets/natural-questions
HybriDialogue	13000	QA	Conversation, Table, Reference	Answer	Wikipedia	https://github.com/entitize/HybridDialogue
TAT-QA	2757	QA	Question, Table	Answer	Financial report	https://github.com/NExTplusplus/TAT-QA
HiTAB	3597	QA/NLG	Question, Table	Answer	Statistical Report and Wikipedia	https://github.com/microsoft/HiTab
ToTTo	120000	NLG	Table	Sentence	Wikipedia	https://github.com/google-research-datasets/ToTTo
FEVEROUS	28800	Classification	Claim, Table	Label	Common Crawl	https://fever.ai/dataset/feverous.html
Dresden Web Tables	125M	Classification	Table	Label	Common Crawl	https://ppasupat.github.io/WikiTableQuestions/
InfoTabs	2540	NLI	Table , Hypothesis	Label	Wikipedia	https://infotabs.github.io/
TabFact	16573	NLI	Table, Statement	Label	Wikipedia	https://tabfact.github.io/
TAPEX	1500	Text2SQL	SQL, Table	Answer	Synthetic	https://github.com/google-research/tapas
Spider	1020	Text2SQL	Table, Question	SQL	Human annotation	https://drive.usercontent.google.com/download?id=1iRDVHLr4mX2wQKSgA9J8Pire73Jahh0m&export=download&authuser=0
WIKISQL	24241	Text2SQL	Table, Question	SQL, Answer	Human Annotated	https://github.com/salesforce/WikiSQL
BIRD	12751	Text2SQL	Table, Question	SQL	Human Annotated	https://bird-bench.github.io/
Tapilot-Crossing	5	Text2Code, QA, RAG	Table, Dialog History, Question, Private Lib, Chart	Python, Private Lib Code, Answer	Human-Agent Interaction	https://tapilot-crossing.github.io/

Survey

A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions

Contributing

If you would like to contribute to this list or writeup, feel free to submit a pull request!

For Tasks:

Click tags to check more tools for each tasks

predict tabular data generate tabular data understand tabular data synthesize tabular data answer questions from tabular data

For Jobs:

data scientist machine learning engineer research scientist data analyst ai engineer

Alternative AI tools for LLM-on-Tabular-Data-Prediction-Table-Understanding-Data-Generation

Similar Open Source Tools

LLM-on-Tabular-Data-Prediction-Table-Understanding-Data-Generation

github

: 68

AgentNeo

AgentNeo is an advanced, open-source Agentic AI Application Observability, Monitoring, and Evaluation Framework designed to provide deep insights into AI agents, Large Language Model (LLM) calls, and tool interactions. It offers robust logging, visualization, and evaluation capabilities to help debug and optimize AI applications with ease. With features like tracing LLM calls, monitoring agents and tools, tracking interactions, detailed metrics collection, flexible data storage, simple instrumentation, interactive dashboard, project management, execution graph visualization, and evaluation tools, AgentNeo empowers users to build efficient, cost-effective, and high-quality AI-driven solutions.

github

: 5.9k

agentneo

AgentNeo is a Python package that provides functionalities for project, trace, dataset, experiment management. It allows users to authenticate, create projects, trace agents and LangGraph graphs, manage datasets, and run experiments with metrics. The tool aims to streamline AI project management and analysis by offering a comprehensive set of features.

github

: 293

LLM-Agent-Survey

LLM-Agent-Survey is a comprehensive repository that provides a curated list of papers related to Large Language Model (LLM) agents. The repository categorizes papers based on LLM-Profiled Roles and includes high-quality publications from prestigious conferences and journals. It aims to offer a systematic understanding of LLM-based agents, covering topics such as tool use, planning, and feedback learning. The repository also includes unpublished papers with insightful analysis and novelty, marked for future updates. Users can explore a wide range of surveys, tool use cases, planning workflows, and benchmarks related to LLM agents.

github

: 113

LLM-Tool-Survey

This repository contains a collection of papers related to tool learning with large language models (LLMs). The papers are organized according to the survey paper 'Tool Learning with Large Language Models: A Survey'. The survey focuses on the benefits and implementation of tool learning with LLMs, covering aspects such as task planning, tool selection, tool calling, response generation, benchmarks, evaluation, challenges, and future directions in the field. It aims to provide a comprehensive understanding of tool learning with LLMs and inspire further exploration in this emerging area.

github

: 220

awesome-azure-openai-llm

This repository is a collection of references to Azure OpenAI, Large Language Models (LLM), and related services and libraries. It provides information on various topics such as RAG, Azure OpenAI, LLM applications, agent design patterns, semantic kernel, prompting, finetuning, challenges & abilities, LLM landscape, surveys & references, AI tools & extensions, datasets, and evaluations. The content covers a wide range of topics related to AI, machine learning, and natural language processing, offering insights into the latest advancements in the field.

github

: 336

OpenRedTeaming

OpenRedTeaming is a repository focused on red teaming for generative models, specifically large language models (LLMs). The repository provides a comprehensive survey on potential attacks on GenAI and robust safeguards. It covers attack strategies, evaluation metrics, benchmarks, and defensive approaches. The repository also implements over 30 auto red teaming methods. It includes surveys, taxonomies, attack strategies, and risks related to LLMs. The goal is to understand vulnerabilities and develop defenses against adversarial attacks on large language models.

github

: 68

unilm

The 'unilm' repository is a collection of tools, models, and architectures for Foundation Models and General AI, focusing on tasks such as NLP, MT, Speech, Document AI, and Multimodal AI. It includes various pre-trained models, such as UniLM, InfoXLM, DeltaLM, MiniLM, AdaLM, BEiT, LayoutLM, WavLM, VALL-E, and more, designed for tasks like language understanding, generation, translation, vision, speech, and multimodal processing. The repository also features toolkits like s2s-ft for sequence-to-sequence fine-tuning and Aggressive Decoding for efficient sequence-to-sequence decoding. Additionally, it offers applications like TrOCR for OCR, LayoutReader for reading order detection, and XLM-T for multilingual NMT.

github

: 19.6k

cia

CIA is a powerful open-source tool designed for data analysis and visualization. It provides a user-friendly interface for processing large datasets and generating insightful reports. With CIA, users can easily explore data, perform statistical analysis, and create interactive visualizations to communicate findings effectively. Whether you are a data scientist, analyst, or researcher, CIA offers a comprehensive set of features to streamline your data analysis workflow and uncover valuable insights.

github

: 167

app_generative_ai

This repository contains course materials for T81 559: Applications of Generative Artificial Intelligence at Washington University in St. Louis. The course covers practical applications of Large Language Models (LLMs) and text-to-image networks using Python. Students learn about generative AI principles, LangChain, Retrieval-Augmented Generation (RAG) model, image generation techniques, fine-tuning neural networks, and prompt engineering. Ideal for students, researchers, and professionals in computer science, the course offers a transformative learning experience in the realm of Generative AI.

github

: 80

MedLLMsPracticalGuide

This repository serves as a practical guide for Medical Large Language Models (Medical LLMs) and provides resources, surveys, and tools for building, fine-tuning, and utilizing LLMs in the medical domain. It covers a wide range of topics including pre-training, fine-tuning, downstream biomedical tasks, clinical applications, challenges, future directions, and more. The repository aims to provide insights into the opportunities and challenges of LLMs in medicine and serve as a practical resource for constructing effective medical LLMs.

github

: 1.3k

Rankify

Rankify is a Python toolkit designed for unified retrieval, re-ranking, and retrieval-augmented generation (RAG) research. It integrates 40 pre-retrieved benchmark datasets and supports 7 retrieval techniques, 24 state-of-the-art re-ranking models, and multiple RAG methods. Rankify provides a modular and extensible framework, enabling seamless experimentation and benchmarking across retrieval pipelines. It offers comprehensive documentation, open-source implementation, and pre-built evaluation tools, making it a powerful resource for researchers and practitioners in the field.

github

: 335

paelladoc

github

: 221

awesome-mobile-robotics

The 'awesome-mobile-robotics' repository is a curated list of important content related to Mobile Robotics and AI. It includes resources such as courses, books, datasets, software and libraries, podcasts, conferences, journals, companies and jobs, laboratories and research groups, and miscellaneous resources. The repository covers a wide range of topics in the field of Mobile Robotics and AI, providing valuable information for enthusiasts, researchers, and professionals in the domain.

github

: 407

Embodied-AI-Guide

Embodied-AI-Guide is a comprehensive guide for beginners to understand Embodied AI, focusing on the path of entry and useful information in the field. It covers topics such as Reinforcement Learning, Imitation Learning, Large Language Model for Robotics, 3D Vision, Control, Benchmarks, and provides resources for building cognitive understanding. The repository aims to help newcomers quickly establish knowledge in the field of Embodied AI.

github

: 4.1k

awesome-business-of-cybersecurity

github

: 77

For similar tasks

LLM-on-Tabular-Data-Prediction-Table-Understanding-Data-Generation

github

: 68

For similar jobs

sweep

Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

github

: 7.1k

teams-ai

The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

github

: 502

ai-guide

This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

github

: 159

classifai

Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

github

: 620

chatbot-ui

Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

github

: 27.7k

BricksLLM

BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

github

: 953

uAgents

uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

github

: 1.3k

griptape

Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.

github

: 2.2k

LLM-on-Tabular-Data-Prediction-Table-Understanding-Data-Generation

README:

Large Language Models on Tabular Data -- A Survey

LLM on Tabular Data Prediction and Understanding -- A Survey

Taxonomy

Prediction task

Tabular Data

Time series

Application Specific

Data Generation task

Table understanding

Numeric Question Answering

Question Answering

Text2SQL

Table2Text

Fact Verification

Table Profiling

Table Transformation

Entity Matching

Datasets

Prediction Tasks

Table Understanding Tasks

Survey

Contributing

For Tasks:

For Jobs:

Alternative AI tools for LLM-on-Tabular-Data-Prediction-Table-Understanding-Data-Generation

Similar Open Source Tools

LLM-on-Tabular-Data-Prediction-Table-Understanding-Data-Generation

AgentNeo

agentneo

LLM-Agent-Survey

LLM-Tool-Survey

awesome-azure-openai-llm

OpenRedTeaming

unilm

cia

app_generative_ai

MedLLMsPracticalGuide

Rankify

paelladoc

awesome-mobile-robotics

Embodied-AI-Guide

awesome-business-of-cybersecurity

For similar tasks

LLM-on-Tabular-Data-Prediction-Table-Understanding-Data-Generation

For similar jobs

sweep

teams-ai

ai-guide

classifai

chatbot-ui

BricksLLM

uAgents

griptape