Awesome-LLM-Tabular

Awesome-LLM-Tabular: a curated list of Large Language Model applied to Tabular Data

Stars: 335

Visit

This repository is a curated list of research papers that explore the integration of Large Language Model (LLM) technology with tabular data. It aims to provide a comprehensive resource for researchers and practitioners interested in this emerging field. The repository includes papers on a wide range of topics, including table-to-text generation, table question answering, and tabular data classification. It also includes a section on related datasets and resources.

README:

Awesome-LLM-Tabular

💡 Since the emergence of ChatGPT, Large Language Models (LLMs) have garnered significant attention, with new advancements continuously emerging. LLMs have found applications in various domains like vision, audio, and text tasks. However, tabular data remains a crucial data format in this world. Hence, this repo focuses on collecting research papers that explore the integration of LLM technology with tabular data, and aims to save you valuable time and boost research efficiency.

✨ Awesome-LLM-Tabular is a curated list of Large Language Model applied to Tabular Data.

🔥 This project is currently under development. Feel free to ⭐ (STAR) and 🔭 (WATCH) it to stay updated on the latest developments.

Table of Content

Awesome-LLM-Tabular

Related Papers

Date	keywords	Paper
2019/09	TabFact	TabFact: A Large-scale Dataset for Table-based Fact Verification
2020	TableGPT	TableGPT: Few-shot Table-to-Text Generation with Table Structure Reconstruction and Content Matching
2020/05	TaBERT	TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data
2020/09	GaPPa	GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing
2022/02	TableQuery	TableQuery: Querying tabular data with natural language
2022/05	FeSTE	Few-Shot Tabular Data Enrichment Using Fine-Tuned Transformer Architectures
2022/05	FM	Can Foundation Models Wrangle Your Data?
2022/05	TURL	Technical Perspective of TURL: Table Understanding through Representation Learning
2022/06	TabText	TabText: A Flexible and Contextual Approach to Tabular Data Representation
2022/06	LIFT	LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks
2022/09	PTab	PTab: Using the Pre-trained Language Model for Modeling Tabular Data
2022/09	TabMWP	Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning
2022/10	GReaT	Language Models are Realistic Tabular Data Generators
2022/10	TabLLM	TabLLM: Few-shot Classification of Tabular Data with Large Language Models
2023/??	IngesTables	IngesTables: Scalable and Efficient Training of LLM-Enabled Tabular Foundation Models
2023/??	Elephants	Elephants Never Forget: Testing Language Models for Memorization of Tabular Data
2023/01	DATER	Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning
2023/02	AdaPTGen	Adapting Prompt for Few-shot Table-to-Text Generation
2023/03	Survey Paper	Transformers for Tabular Data Representation: A Survey of Models and Applications
2023/04	TABLET	TABLET: Learning From Instructions For Tabular Data
2023/05	AnyPredict	AnyPredict: Foundation Model for Tabular Prediction
2023/05	TAPTAP	Generative Table Pre-training Empowers Models for Tabular Prediction
2023/05	Bird	Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs
2023/07	TableGPT	TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT
2023/07	UniTabE	UniTabE: A Universal Pretraining Protocol for Tabular Foundation Model in Data Science
2023/10	TabFMs	TOWARDS FOUNDATION MODELS FOR LEARNING ON TABULAR DATA
2023/10	TableFormat	Tabular Representation, Noisy Operators, and Impacts on Table Structure Understanding Tasks in LLMs
2023/10	UniPredict	UniPredict: Large Language Models are Universal Tabular Classifiers
2023/10	Table-GPT	Table-GPT: Table-tuned GPT for Diverse Table Tasks
2024/03	TableLLM	Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study
2023/11	NumericalReasoning	Exploring the Numerical Reasoning Capabilities of Language Models: A Comprehensive Analysis on Tabular Data
2023/12	TaCo	Chain-of-Thought Reasoning in Tabular Language Models
2023/12	TableLLM	Rethinking Tabular Data Understanding with Large Language Models
2024/01	Chain-of-Table	Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding
2024/01	TAT-LLM	TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data
2024/01	DataFrame QA	DataFrame QA: A Universal LLM Framework on DataFrame Question Answering Without Data Exposure
2024/02	Survey Paper	LLM on Tabular Data: Prediction, Generation, and Understanding
2024/02	Survey Paper	Large Language Model for Table Processing: A Survey
2024/02	CABINET	CABINET: Content Relevance based Noise Reduction for Table Question Answering
2024/02	OpenTab	OpenTab: Advancing Large Language Models as Open-domain Table Reasoners
2024/02	CancerGPT	CancerGPT for few shot drug pair synergy prediction using large pretrained language models
2024/02	Exploration of LLM on Tabular	Tables as Images? Exploring the Strengths and Limitations of LLMs on Multimodal Representations of Tabular Data
2024/03	TableLLM	Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study
2024/03	ITAB-LLM	Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science
2024/03	TP-BERTa	Making Pre-trained Language Models Great on Tabular Prediction
2024/04	FeatLLM	Large Language Models Can Automatically Engineer Features for Few-Shot Tabular Learning
2024/04	TabSQLify	TabSQLify: Enhancing Reasoning Capabilities of LLMs Through Table Decomposition
2024/04	LLMClean	LLMClean: Context-Aware Tabular Data Cleaning via LLM-Generated OFDs
2024/04	Elephants Never Forget	Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models
2024/05	CARTE	CARTE: Pretraining and Transfer for Tabular Learning
2024/05	DataBench	Question Answering over Tabular Data with DataBench: A Large-Scale Empirical Evaluation of LLMs
2024/05	Position	Position: Why Tabular Foundation Models Should Be a Research Priority
2024/06	Struc-Bench	Struc-Bench: Are Large Language Models Good at Generating Complex Structured Tabular Data?
2024/07	folktexts	Evaluating language models as risk scores
2024/07	SpreadsheetLLM	SpreadsheetLLM: Encoding Spreadsheets for Large Language Models
2024/08	rLLM	rLLM: Relational Table Learning with LLMs

Workshops

Useful Blogs

A Short Chronology Of Deep Learning For Tabular Data by Sebastian Raschka

Citation

@misc{wu2024awesomellmtabular,
  author = {Hong-Wei, Wu},
  title = {Awesome-LLM-Tabular},
  year = {2024},
  note = {Accessed: 2024-05-30},
  url = {https://github.com/johnnyhwu/Awesome-LLM-Tabular},
  orcid = {https://orcid.org/0009-0005-8073-5297}
}

Contributing

We welcome contributions to keep this repository up-to-date with the latest research and applications of LLM in the tabular domain. Whether you want to correct any mistakes, add new content, or suggest improvements, your contributions are highly appreciated 🤗.

For Tasks:

Click tags to check more tools for each tasks

generate text from tables answer questions about tables classify tabular data

For Jobs:

data scientist machine learning engineer research scientist data analyst business analyst

Alternative AI tools for Awesome-LLM-Tabular

Similar Open Source Tools

Awesome-LLM-Tabular

github

: 335

LLamaTuner

LLamaTuner is a repository for the Efficient Finetuning of Quantized LLMs project, focusing on building and sharing instruction-following Chinese baichuan-7b/LLaMA/Pythia/GLM model tuning methods. The project enables training on a single Nvidia RTX-2080TI and RTX-3090 for multi-round chatbot training. It utilizes bitsandbytes for quantization and is integrated with Huggingface's PEFT and transformers libraries. The repository supports various models, training approaches, and datasets for supervised fine-tuning, LoRA, QLoRA, and more. It also provides tools for data preprocessing and offers models in the Hugging Face model hub for inference and finetuning. The project is licensed under Apache 2.0 and acknowledges contributions from various open-source contributors.

github

: 586

TRACE

TRACE is a temporal grounding video model that utilizes causal event modeling to capture videos' inherent structure. It presents a task-interleaved video LLM model tailored for sequential encoding/decoding of timestamps, salient scores, and textual captions. The project includes various model checkpoints for different stages and fine-tuning on specific datasets. It provides evaluation codes for different tasks like VTG, MVBench, and VideoMME. The repository also offers annotation files and links to raw videos preparation projects. Users can train the model on different tasks and evaluate the performance based on metrics like CIDER, METEOR, SODA_c, F1, mAP, Hit@1, etc. TRACE has been enhanced with trace-retrieval and trace-uni models, showing improved performance on dense video captioning and general video understanding tasks.

github

: 54

nx

Nx is a build system optimized for monorepos, featuring AI-powered architectural awareness and advanced CI capabilities. It provides faster task scheduling, caching, and more for existing workspaces. Nx Cloud enhances CI by offering remote caching, task distribution, automated e2e test splitting, and task flakiness detection. The tool aims to scale monorepos efficiently and improve developer productivity.

github

: 25.2k

nntrainer

NNtrainer is a software framework for training neural network models on devices with limited resources. It enables on-device fine-tuning of neural networks using user data for personalization. NNtrainer supports various machine learning algorithms and provides examples for tasks such as few-shot learning, ResNet, VGG, and product rating. It is optimized for embedded devices and utilizes CBLAS and CUBLAS for accelerated calculations. NNtrainer is open source and released under the Apache License version 2.0.

github

: 135

llm-book

The 'llm-book' repository is dedicated to the introduction of large-scale language models, focusing on natural language processing tasks. The code is designed to run on Google Colaboratory and utilizes datasets and models available on the Hugging Face Hub. Note that as of July 28, 2023, there are issues with the MARC-ja dataset links, but an alternative notebook using the WRIME Japanese sentiment analysis dataset has been added. The repository covers various chapters on topics such as Transformers, fine-tuning language models, entity recognition, summarization, document embedding, question answering, and more.

github

: 291

awesome-llm-webapps

This repository is a curated list of open-source, actively maintained web applications that leverage large language models (LLMs) for various use cases, including chatbots, natural language interfaces, assistants, and question answering systems. The projects are evaluated based on key criteria such as licensing, maintenance status, complexity, and features, to help users select the most suitable starting point for their LLM-based applications. The repository welcomes contributions and encourages users to submit projects that meet the criteria or suggest improvements to the existing list.

github

: 173

data-prep-kit

Data Prep Kit accelerates unstructured data preparation for LLM app developers. It allows developers to cleanse, transform, and enrich unstructured data for pre-training, fine-tuning, instruct-tuning LLMs, or building RAG applications. The kit provides modules for Python, Ray, and Spark runtimes, supporting Natural Language and Code data modalities. It offers a framework for custom transforms and uses Kubeflow Pipelines for workflow automation. Users can install the kit via PyPi and access a variety of transforms for data processing pipelines.

github

: 600

visionOS-examples

visionOS-examples is a repository containing accelerators for Spatial Computing. It includes examples such as Local Large Language Model, Chat Apple Vision Pro, WebSockets, Anchor To Head, Hand Tracking, Battery Life, Countdown, Plane Detection, Timer Vision, and PencilKit for visionOS. The repository showcases various functionalities and features for Apple Vision Pro, offering tools for developers to enhance their visionOS apps with capabilities like hand tracking, plane detection, and real-time cryptocurrency prices.

github

: 223

cs-books

github

: 52

dora

Dataflow-oriented robotic application (dora-rs) is a framework that makes creation of robotic applications fast and simple. Building a robotic application can be summed up as bringing together hardwares, algorithms, and AI models, and make them communicate with each others. At dora-rs, we try to: make integration of hardware and software easy by supporting Python, C, C++, and also ROS2. make communication low latency by using zero-copy Arrow messages. dora-rs is still experimental and you might experience bugs, but we're working very hard to make it stable as possible.

github

: 2.0k

fastapi

智元 Fast API is a one-stop API management system that unifies various LLM APIs in terms of format, standards, and management, achieving the ultimate in functionality, performance, and user experience. It supports various models from companies like OpenAI, Azure, Baidu, Keda Xunfei, Alibaba Cloud, Zhifu AI, Google, DeepSeek, 360 Brain, and Midjourney. The project provides user and admin portals for preview, supports cluster deployment, multi-site deployment, and cross-zone deployment. It also offers Docker deployment, a public API site for registration, and screenshots of the admin and user portals. The API interface is similar to OpenAI's interface, and the project is open source with repositories for API, web, admin, and SDK on GitHub and Gitee.

github

: 245

Awesome-LLM-for-RecSys

github

: 1.2k

kangaroo

Kangaroo is an AI-powered SQL client and admin tool for popular databases like SQLite, MySQL, PostgreSQL, etc. It supports various functionalities such as table design, query, model, sync, export/import, and more. The tool is designed to be comfortable, fun, and developer-friendly, with features like code intellisense and autocomplete. Kangaroo aims to provide a seamless experience for database management across different operating systems.

github

: 370

Awesome-Tabular-LLMs

This repository is a collection of papers on Tabular Large Language Models (LLMs) specialized for processing tabular data. It includes surveys, models, and applications related to table understanding tasks such as Table Question Answering, Table-to-Text, Text-to-SQL, and more. The repository categorizes the papers based on key ideas and provides insights into the advancements in using LLMs for processing diverse tables and fulfilling various tabular tasks based on natural language instructions.

github

: 151

Awesome-LLM-Resources-List

Awesome LLM Resources is a curated collection of resources for Large Language Models (LLMs) covering various aspects such as serverless hosting, accessing off-the-shelf models via API, local inference, LLM serving frameworks, open-source LLM web chat UIs, renting GPUs for fine-tuning, fine-tuning with no-code UI, fine-tuning frameworks, OS agentic/AI workflow, AI agents, co-pilots, voice API, open-source TTS models, OS RAG frameworks, research papers on chain-of-thought prompting, CoT implementations, CoT fine-tuned models & datasets, and more.

github

: 126

For similar tasks

Awesome-LLM-Tabular

github

: 335

For similar jobs

Azure-Analytics-and-AI-Engagement

The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customer’s subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.

github

: 136

skyvern

Skyvern automates browser-based workflows using LLMs and computer vision. It provides a simple API endpoint to fully automate manual workflows, replacing brittle or unreliable automation solutions. Traditional approaches to browser automations required writing custom scripts for websites, often relying on DOM parsing and XPath-based interactions which would break whenever the website layouts changed. Instead of only relying on code-defined XPath interactions, Skyvern adds computer vision and LLMs to the mix to parse items in the viewport in real-time, create a plan for interaction and interact with them. This approach gives us a few advantages: 1. Skyvern can operate on websites it’s never seen before, as it’s able to map visual elements to actions necessary to complete a workflow, without any customized code 2. Skyvern is resistant to website layout changes, as there are no pre-determined XPaths or other selectors our system is looking for while trying to navigate 3. Skyvern leverages LLMs to reason through interactions to ensure we can cover complex situations. Examples include: 1. If you wanted to get an auto insurance quote from Geico, the answer to a common question “Were you eligible to drive at 18?” could be inferred from the driver receiving their license at age 16 2. If you were doing competitor analysis, it’s understanding that an Arnold Palmer 22 oz can at 7/11 is almost definitely the same product as a 23 oz can at Gopuff (even though the sizes are slightly different, which could be a rounding error!) Want to see examples of Skyvern in action? Jump to #real-world-examples-of- skyvern

github

: 12.9k

pandas-ai

PandasAI is a Python library that makes it easy to ask questions to your data in natural language. It helps you to explore, clean, and analyze your data using generative AI.

github

: 14.0k

vanna

Vanna is an open-source Python framework for SQL generation and related functionality. It uses Retrieval-Augmented Generation (RAG) to train a model on your data, which can then be used to ask questions and get back SQL queries. Vanna is designed to be portable across different LLMs and vector databases, and it supports any SQL database. It is also secure and private, as your database contents are never sent to the LLM or the vector database.

github

: 10.8k

databend

Databend is an open-source cloud data warehouse that serves as a cost-effective alternative to Snowflake. With its focus on fast query execution and data ingestion, it's designed for complex analysis of the world's largest datasets.

github

: 7.7k

Avalonia-Assistant

Avalonia-Assistant is an open-source desktop intelligent assistant that aims to provide a user-friendly interactive experience based on the Avalonia UI framework and the integration of Semantic Kernel with OpenAI or other large LLM models. By utilizing Avalonia-Assistant, you can perform various desktop operations through text or voice commands, enhancing your productivity and daily office experience.

github

: 113

marvin

Marvin is a lightweight AI toolkit for building natural language interfaces that are reliable, scalable, and easy to trust. Each of Marvin's tools is simple and self-documenting, using AI to solve common but complex challenges like entity extraction, classification, and generating synthetic data. Each tool is independent and incrementally adoptable, so you can use them on their own or in combination with any other library. Marvin is also multi-modal, supporting both image and audio generation as well using images as inputs for extraction and classification. Marvin is for developers who care more about _using_ AI than _building_ AI, and we are focused on creating an exceptional developer experience. Marvin users should feel empowered to bring tightly-scoped "AI magic" into any traditional software project with just a few extra lines of code. Marvin aims to merge the best practices for building dependable, observable software with the best practices for building with generative AI into a single, easy-to-use library. It's a serious tool, but we hope you have fun with it. Marvin is open-source, free to use, and made with 💙 by the team at Prefect.

github

: 5.5k

activepieces

Activepieces is an open source replacement for Zapier, designed to be extensible through a type-safe pieces framework written in Typescript. It features a user-friendly Workflow Builder with support for Branches, Loops, and Drag and Drop. Activepieces integrates with Google Sheets, OpenAI, Discord, and RSS, along with 80+ other integrations. The list of supported integrations continues to grow rapidly, thanks to valuable contributions from the community. Activepieces is an open ecosystem; all piece source code is available in the repository, and they are versioned and published directly to npmjs.com upon contributions. If you cannot find a specific piece on the pieces roadmap, please submit a request by visiting the following link: Request Piece Alternatively, if you are a developer, you can quickly build your own piece using our TypeScript framework. For guidance, please refer to the following guide: Contributor's Guide

github

: 12.6k