LLM-Synthetic-Data

LLM-Synthetic-Data

Real-time, fine-grained LLM synthetic data involves data of, by, and for LLMs.

Stars: 101

Visit
 screenshot

LLM-Synthetic-Data is a repository focused on real-time, fine-grained LLM-Synthetic-Data generation. It includes methods, surveys, and application areas related to synthetic data for language models. The repository covers topics like pre-training, instruction tuning, model collapse, LLM benchmarking, evaluation, and distillation. It also explores application areas such as mathematical reasoning, code generation, text-to-SQL, alignment, reward modeling, long context, weak-to-strong generalization, agent and tool use, vision and language, factuality, federated learning, generative design, and safety.

README:

Real-time, Fine-grained LLM-Synthetic-Data

LICENSE commit PR GitHub Repo stars

This repo, inspired by Awesome-LLM-Synthetic-Data, which focuses on real-time, fine-grained LLM-Synthetic-Data.

If you find this useful, feel free to follow us and star both repos.

Thanks to all the great contributors on GitHub! 🔥⚡🔥

Contents

1. Githubs:

2. Blogs:

3. Surveys:

  1. Best Practices and Lessons Learned on Synthetic Data for Language Models. Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, Andrew M. Dai. COLM 2024.
  2. On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey. Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, Haobo Wang. ACL Findings 2024.
  3. Large Language Models for Data Annotation: A Survey Zhen Tan, Dawei Li, Song Wang, Alimohammad Beigi, Bohan Jiang, Amrita Bhattacharjee, Mansooreh Karami, Jundong Li, Lu Cheng, Huan Liu. EMNLP 2024.
  4. Generative AI for Synthetic Data Generation: Methods, Challenges and the Future. Xu Guo, Yiqiang Chen. Arxiv 2024.
  5. Comprehensive Exploration of Synthetic Data Generation: A Survey. André Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle Chard, Ian Foster. Arxiv 2024.

4. Methods:

4.1. Pre-training

4.2. Instruction Tuning

4.3. Model Collapse

4.4. LLM Benchmarking

4.5. Evaluation

4.6. Distillation

5. Application Areas:

5.1. Mathematical Reasoning

5.2. Code Generation

5.3. Text-to-SQL

5.4. Alignment

5.5. Reward Modeling

5.6. Long Context

5.7. Weak-to-Strong

5.8. Agent and Tool Use

5.9. Vision and Language

5.10. Factuality

5.11. Federated Learning

5.12. Generative Design

5.13 Safety

6. Tools

7. Datasets

Star History

Star History Chart

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for LLM-Synthetic-Data

Similar Open Source Tools

For similar tasks

For similar jobs