LLM-Synthetic-Data

Real-time, fine-grained LLM synthetic data involves data of, by, and for LLMs.

Stars: 101

Visit

LLM-Synthetic-Data is a repository focused on real-time, fine-grained LLM-Synthetic-Data generation. It includes methods, surveys, and application areas related to synthetic data for language models. The repository covers topics like pre-training, instruction tuning, model collapse, LLM benchmarking, evaluation, and distillation. It also explores application areas such as mathematical reasoning, code generation, text-to-SQL, alignment, reward modeling, long context, weak-to-strong generalization, agent and tool use, vision and language, factuality, federated learning, generative design, and safety.

README:

Real-time, Fine-grained LLM-Synthetic-Data

This repo, inspired by Awesome-LLM-Synthetic-Data, which focuses on real-time, fine-grained LLM-Synthetic-Data.

If you find this useful, feel free to follow us and star both repos.

Thanks to all the great contributors on GitHub! 🔥⚡🔥

1. Githubs
2. Blogs
3. Surveys
4. Methods
5. Application Areas
6. Tools
7. Datasets

1. Githubs:

https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data.

2. Blogs:

Synthetic data: save money, time and carbon with open source. Moritz Laurer. Feb 16, 2024.
Synthetic data generation (Part 1). Dylan Royan Almeida. Apr 10, 2024
Synthetic dataset generation techniques: Self-Instruct. Daniel van Strien. May 15, 2024.
CodecLM: Aligning language models with tailored synthetic data Zifeng Wang and Chen-Yu Lee. May 30, 2024.
The Rise of Agentic Data Generation. Maxime Labonne. July 15, 2024.
LLM-Driven Synthetic Data Generation, Curation & Evaluation. Cobus Greyling. Aug 2, 2024.
Using LLMs for Synthetic Data Generation: The Definitive Guide Kritin Vongthongsri. November 8, 2024.

3. Surveys:

Best Practices and Lessons Learned on Synthetic Data for Language Models. Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, Andrew M. Dai. COLM 2024.
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey. Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, Haobo Wang. ACL Findings 2024.
Large Language Models for Data Annotation: A Survey Zhen Tan, Dawei Li, Song Wang, Alimohammad Beigi, Bohan Jiang, Amrita Bhattacharjee, Mansooreh Karami, Jundong Li, Lu Cheng, Huan Liu. EMNLP 2024.
Generative AI for Synthetic Data Generation: Methods, Challenges and the Future. Xu Guo, Yiqiang Chen. Arxiv 2024.
Comprehensive Exploration of Synthetic Data Generation: A Survey. André Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle Chard, Ian Foster. Arxiv 2024.

4. Methods:

4.1. Pre-training

Phi-4 Technical Report Microsoft Research. Arxiv 2024.
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Arxiv 2024
MAmmoTH2: Scaling Instructions from the Web. Xiang Yue, Tuney Zheng, Ge Zhang, Wenhu Chen. Neurips 2024.
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent Tecent Hunyuan Team. Arxiv 2024.

4.2. Instruction Tuning

STaR: Bootstrapping Reasoning With Reasoning Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman. NeurIPS 2022.
Generating Training Data with Language Models: Towards Zero-Shot Language Understanding Yu Meng, Jiaxin Huang, Yu Zhang, Jiawei Han. NeurIPS 2022.
ZeroGen: Efficient Zero-shot Learning via Dataset Generation Jiacheng Ye, Jiahui Gao, Qintong Li, Hang Xu, Jiangtao Feng, Zhiyong Wu, Tao Yu, Lingpeng Kong. EMNLP 2022.
Symbolic Knowledge Distillation: from General Language Models to Commonsense Models Peter West, Chandra Bhagavatula, Jack Hessel, Jena D. Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu, Sean Welleck, Yejin Choi. NAACL 2022.
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay Krishna, Jiaming Shen, Chao Zhang. NeurIPS D&B 2023.
Self-instruct: Aligning language models with self-generated instructions. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi. ACL 2023.
Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions John Joon Young Chung, Ece Kamar, Saleema Amershi. ACL 2023.
Large Language Models Can Self-Improve Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han. EMNLP 2023.
Making Large Language Models Better Data Creators Dong-Ho Lee, Jay Pujara, Mohit Sewak, Ryen W. White, Sujay Kumar Jauhar. EMNLP 2023.
Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science Veniamin Veselovsky, Manoel Horta Ribeiro, Akhil Arora, Martin Josifoski, Ashton Anderson, Robert West, Arxiv 2023.
Self-Rewarding Language Models. Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston. ICML 2024.
Scaling Synthetic Data Creation with 1,000,000,000 Personas. Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, Dong Yu. Arxiv 2024.
Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation Nihal V. Nayak, Yiyang Nan, Avi Trost, Stephen H. Bach. ACL Findings 2024.
Rephrasing theWeb A Recipe for Compute and Data-Efficient Language Modeling Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly. ACL 2024.
Automatic Instruction Evolving for Large Language Models. Weihao Zeng, Can Xu, Yingxiu Zhao, Jian-Guang Lou, Weizhu Chen. EMNLP 2024.
Self-playing Adversarial Language Game Enhances LLM Reasoning Pengyu Cheng, Tianhao Hu, Han Xu, Zhisong Zhang, Yong Dai, Lei Han, Nan Du. Neurips 2024.
WizardLM: Empowering Large Language Models to Follow Complex Instructions. Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Daxin Jiang. ICLR 2024.
CodecLM: Aligning Language Models with Tailored Synthetic Data. Zifeng Wang, Chun-Liang Li, Vincent Perot, Long T. Le, Jin Miao, Zizhao Zhang, Chen-Yu Lee, Tomas Pfister. NAACL Findings 2024.
TarGEN: Targeted Data Generation with Large Language Models Himanshu Gupta, Kevin Scaria, Ujjwala Anantheswaran, Shreyas Verma, Mihir Parmar, Saurabh Arjun Sawant, Chitta Baral, Swaroop Mishra. COLM 2024.
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang, Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang, Yuxian Gu, Xin Cheng, Xun Wang, Si-Qing Chen, Li Dong, Wei Lu, Zhifang Sui, Benyou Wang, Wai Lam, Furu Wei. Submit to ICLR 2025.
Forewarned is Forearmed: Harnessing LLMs for Data Synthesis via Failure-induced Exploration. Submit to ICLR 2025.
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, Bill Yuchen Lin. Submit to ICLR 2025.
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli. Submit to ICLR 2025.

4.3. Model Collapse

AI models collapse when trained on recursively generated data Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson & Yarin Gal, Nature 2024.
How bad is training on synthetic data? A statistical analysis of language model collapse. Mohamed El Amine Seddik, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, Merouane Abdelkader DEBBAH. COLM 2024.
ToEdit: How to Synthesize Text Data to Avoid Model Collapse? Submit to ICLR 2025.
Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement Yunzhen Feqng, Elvis Dohmatob, Pu Yang, Francois Charton, Julia Kempe. Submit to ICLR 2025.

4.4. LLM Benchmarking

DataGen: Unified Synthetic Dataset Generation via Large Language Models Submit to ICLR 2025.

4.5. Evaluation

On the Diversity of Synthetic Data and its Impact on Training Large Language Models Hao Chen, Abdul Waheed, Xiang Li, Yidong Wang, Jindong Wang, Bhiksha Raj, Marah I. Abdin. Submit to ICLR 2025.

4.6. Distillation

FreeAL: Towards Human-Free Active Learning in the Era of Large Language Models, Ruixuan Xiao, Yiwen Dong, Junbo Zhao, Runze Wu, Minmin Lin, Gang Chen, Haobo Wang. EMNLP 2023.
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models Ruida Wang, Wangchunshu Zhou, Mrinmaya Sachan, EMNLP finding2023
Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations, Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, Ming Yin. EMNLP finding2023.

5. Application Areas:

5.1. Mathematical Reasoning

Distilling LLMs' Decomposition Abilities into Compact Language Models Denis Tarasov, Kumar Shridhar. AutoRL@ICML 2024.
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning Chengpeng Li, Zheng Yuan, Hongyi Yuan, Guanting Dong, Keming Lu, Jiancan Wu, Chuanqi Tan, Xiang Wang, Chang Zhou. ACL 2024.
MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs Zimu Lu, Aojun Zhou, Houxing Ren, Ke Wang, Weikang Shi, Junting Pan, Mingjie Zhan, Hongsheng Li. ACL 2024.
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, Weiyang Liu. ICLR 2024.
Augmenting Math Word Problems via Iterative Question Composing Haoxiong Liu, Yifan Zhang, Yifan Luo, Andrew Chi-Chih Yao. DPFM@ICLR 2024.
Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning Yiming Huang, Xiao Liu, Yeyun Gong, Zhibin Gou, Yelong Shen, Nan Duan, Weizhu Chen. Arxiv 2024.
Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, Junxian He. NeurIPS 2024.

5.2. Code Generation

Program Synthesis with Large Language Models Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, Charles Sutton. Arxiv 2021.
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi. NeurIPS 2022.
InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback John Yang, Akshara Prabhakar, Karthik Narasimhan, Shunyu Yao. Arxiv 2023.
Language Models Can Teach Themselves to Program Better Patrick Haluptzok, Matthew Bowers, Adam Tauman Kalai. ICLR 2023.
CODEGEN: AN OPEN LARGE LANGUAGE MODEL FOR CODE WITH MULTI-TURN PROGRAM SYNTHESIS. ICLR2023.
Code Alpaca: An Instruction-following LLaMA Model trained on code generation instructions Sahil Chaudhary. GitHub 2023.
Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models Somshubra Majumdar, Vahid Noroozi, Sean Narenthiran, Aleksander Ficek, Jagadeesh Balam, Boris Ginsburg. Arxiv 2024.
Magicoder: Empowering Code Generation with OSS-Instruct Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, Lingming Zhang. ICML 2024.
WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning Zhaojian Yu, Xin Zhang, Ning Shang, Yangyu Huang, Can Xu, Yishujie Zhao, Wenxiang Hu, Qiufeng Yin. ACL 2024.
WizardCoder: Empowering Code Large Language Models with Evol-Instruct Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, Daxin Jiang. ICLR 2024.
Learning Performance-Improving Code Edits Alexander Shypula, Aman Madaan, Yimeng Zeng, Uri Alon, Jacob Gardner, Milad Hashemi, Graham Neubig, Parthasarathy Ranganathan, Osbert Bastani, Amir Yazdanbakhsh. ICLR 2024.
InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct Yutong Wu, Di Huang, Wenxuan Shi, Wei Wang, Lingzhe Gao, Shihao Liu, Ziyuan Nan, Kaizhao Yuan, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Yewen Pu, Dawei Yin, Xing Hu, Yunji Chen. Arxiv 2024.
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, Xiang Yue. Arxiv 2024.
AutoCoder: Enhancing Code Large Language Model with AIEV-Instruct Bin Lei, Yuchen Li, Qiuwu Chen. Arxiv 2024.
How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data Yejie Wang, Keqing He, Dayuan Fu, Zhuoma Gongque, Heyang Xu, Yanxu Chen, Zhexu Wang, Yujia Fu, Guanting Dong, Muxi Diao, Jingang Wang, Mengdi Zhang, Xunliang Cai, Weiran Xu. Arxiv 2024.
SelfCodeAlign: Self-Alignment for Code Generation Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro von Werra, Arjun Guha, Lingming Zhang. Arxiv 2024.

5.3. Text-to-SQL

Synthesizing Text-to-SQL Data from Weak and Strong LLMs Jiaxi Yang, Binyuan Hui, Min Yang, Jian Yang, Junyang Lin, Chang Zhou. ACL 2024.
Synthetic-Text-To-SQL: A synthetic dataset for training language models to generate SQL queries from natural language prompts Meyer, Yev and Emadi, Marjan and Nathawani, Dhruv and Ramaswamy, Lipika and Boyd, Kendrick and Van Segbroeck, Maarten and Grossman, Matthew and Mlocek, Piotr and Newberry, Drew. Huggingface 2024.

5.4. Alignment

Constitutional AI: Harmlessness from AI Feedback Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan. Arxiv 2022.
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan. NeurIPS 2023.
SALMON: Self-Alignment with Instructable Reward Models Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan. ICLR 2024.
Refined Direct Preference Optimization with Synthetic Data for Behavioral Alignment of LLMs V´ıctor Gallego. Arxiv 2024.
Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models. Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, Jingren Zhou. Submit to ICLR 2025.
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Tim Rocktäschel, Roberta Raileanu. NeurIPS 2024.
Self-Alignment with Instruction Backtranslation Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Omer Levy, Luke Zettlemoyer, Jason Weston, Mike Lewis. ICLR 2024.

5.5. Reward Modeling

West-of-N: Synthetic Preference Generation for Improved Reward Modeling. Alizée Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, Aliaksei Severyn. Arxiv 2024.

5.6. Long Context

Make Your LLM Fully Utilize the Context. Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou. Arxiv 2024.
From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data. Zheyang Xiong, Vasilis Papageorgiou, Kangwook Lee, Dimitris Papailiopoulos. Submit to ICLR 2025.
Scaling Instruction-tuned LLMs to Million-token Contexts via Hierarchical Synthetic Data Generation Submit to ICLR 2025.

5.7. Weak-to-Strong

Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, Jeffrey Wu. ICML 2024.
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models. Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan Gu. ICML 2024.
Impossible Distillation for Paraphrasing and Summarization: How to Make High-quality Lemonade out of Small, Low-quality Models Jaehun Jung, Peter West, Liwei Jiang, Faeze Brahman, Ximing Lu, Jillian Fisher, Taylor Sorensen, Yejin Choi. NAACL 2024.

5.8. Agent and Tool Use

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases. Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, Le Sun. Arxiv 2023.
Toolformer: Language Models Can Teach Themselves to Use Tools. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom. NeurIPS 2023.
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction. Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, Ying Shan. Neurips 2023.
Gorilla: Large Language Model Connected with Massive APIs. Shishir G. Patil, Tianjun Zhang, Xin Wang, Joseph E. Gonzalez. NeurIPS 2024.
Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs. Shadi Iskander, Nachshon Cohen, Zohar Karnin, Ori Shapira, Sofia Tolmach. EMNLP 2024.
Voyager: An Open-Ended Embodied Agent with Large Language Models. Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, Anima Anandkumar. TMLR 2024.

5.9. Vision and Language

Visual Instruction Tuning Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee. NeurIPS 2023.
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou. Arxiv 2023.
G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, Lingpeng Kong. Sumit to ICLR 2025.
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny. ICLR 2024.
Enhancing Large Vision Language Models with Self-Training on Image Comprehension Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, James Zou, Kai-Wei Chang, Wei Wang. NeurIPS 2024.
LLaVA-OneVision: Easy Visual Task Transfer Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li. Submit to TMLR.

5.10. Factuality

MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents Liyan Tang, Philippe Laban, Greg Durrett. EMNLP 2024.
Fine-tuning Language Models for Factuality Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D. Manning, Chelsea Finn. ICLR 2024.

5.11. Federated Learning

Prompt Public Large Language Models to Synthesize Data for Private On-device Applications. Shanshan Wu, Zheng Xu, Yanxiang Zhang, Yuanbo Zhang, Daniel Ramage. COLM 2024.
Harnessing large-language models to generate private synthetic text. Alexey Kurakin, Natalia Ponomareva, Umar Syed, Liam MacDermed, Andreas Terzis. Arxiv 2024.

5.12. Generative Design

Generative Design through Quality-Diversity Data Synthesis and Language Models. Adam Gaier, James Stoddart, Lorenzo Villaggi, Shyam Sudhakaran. GECCO 2024.

5.13 Safety

SynthPAI: A Synthetic Dataset for Personal Attribute Inference Hanna Yukhymenko, Robin Staab, Mark Vero, Martin Vechev. NeurIPS D&B 2024.

6. Tools

DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows. Ajay Patel, Colin Raffel, Chris Callison-Burch. ACL 2024.
AgentInstruct: Toward Generative Teaching with Agentic Flows. Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah. Arxiv 2024.
Distilabel: An AI Feedback (AIF) Framework for Building Datasets with and for LLMs. Álvaro Bartolomé Del Canto, Gabriel Martín Blázquez, Agustín Piqueres Lajarín and Daniel Vila Suero. GitHub 2024.
Fuxion: Synthetic Data Generation and Normalization Functions using Langchain + LLMs.

7. Datasets

Open Artificial Knowledge Vadim Borisov, Richard Schreiber. ICML Workshop 2024.

Star History

For Tasks:

Click tags to check more tools for each tasks

generate synthetic data tune instructions benchmark llms align language models improve code generation

For Jobs:

data scientist machine learning engineer ai researcher data analyst software developer

Alternative AI tools for LLM-Synthetic-Data

Similar Open Source Tools

LLM-Synthetic-Data

github

: 101

Embodied_AI_Paper_List

github

: 143

EDA-AI

EDA-AI is a repository containing implementations of cutting-edge research papers in the field of chip design. It includes DeepPlace, PRNet, HubRouter, and PreRoutGNN models for tasks such as placement, routing, timing prediction, and global routing. Researchers and practitioners can leverage these implementations to explore advanced techniques in chip design.

github

: 196

CoachAI-Projects

This repo contains official implementations of **Coach AI Badminton Project** from Advanced Database System Laboratory, National Yang Ming Chiao Tung University supervised by Prof. Wen-Chih Peng. The high-level concepts of each project are as follows: 1. Visualization Platform published at _Physical Education Journal 2020_ aims to construct a platform that can be used to illustrate the data from matches. 2. Shot Influence and Extension Work published at _ICDM-21_ and _ACM TIST 2022_ , respectively introduce a framework with a shot encoder, a pattern extractor, and a rally encoder to capture long short-term dependencies for evaluating players' performance of each shot. 3. Stroke Forecasting published at _AAAI-22_ proposes the first stroke forecasting task to predict the future strokes of both players based on the given strokes by ShuttleNet, a position-aware fusion of rally progress and player styles framework. 4. Strategic Environment published at _AAAI-23 Student Abstract_ designs a safe and reproducible badminton environment for turn-based sports, which simulates rallies with different angles of view and designs the states, actions, and training procedures. 5. Movement Forecasting published at _AAAI-23_ proposes the first movement forecasting task, which contains not only the goal of stroke forecasting but also the movement of players, by DyMF, a novel dynamic graphs and hierarchical fusion model based on the proposed player movements (PM) graphs. 6. CoachAI-Challenge-IJCAI2023 is a badminton challenge (CC4) hosted at _IJCAI-23_. Please find the website for more details. 7. ShuttleSet published at _KDD-23_ is the largest badminton singles dataset with stroke-level records. - An extension dataset ShuttleSet22 published at _IJCAI-24 Demo & IJCAI-23 IT4PSS Workshop_ is also released. 8. CoachAI Badminton Environment published at _AAAI-24 Student Abstract and Demo, DSAI4Sports @ KDD 2023_ is a reinforcement learning (RL) environment tailored for AI-driven sports analytics, offering: i) Realistic opponent simulation for RL training; ii) Visualizations for evaluation; and iii) Performance benchmarks for assessing agent capabilities.

github

: 70

LLMEvaluation

The LLMEvaluation repository is a comprehensive compendium of evaluation methods for Large Language Models (LLMs) and LLM-based systems. It aims to assist academics and industry professionals in creating effective evaluation suites tailored to their specific needs by reviewing industry practices for assessing LLMs and their applications. The repository covers a wide range of evaluation techniques, benchmarks, and studies related to LLMs, including areas such as embeddings, question answering, multi-turn dialogues, reasoning, multi-lingual tasks, ethical AI, biases, safe AI, code generation, summarization, software performance, agent LLM architectures, long text generation, graph understanding, and various unclassified tasks. It also includes evaluations for LLM systems in conversational systems, copilots, search and recommendation engines, task utility, and verticals like healthcare, law, science, financial, and others. The repository provides a wealth of resources for evaluating and understanding the capabilities of LLMs in different domains.

github

: 94

Awesome-LLM-Prune

This repository is dedicated to the pruning of large language models (LLMs). It aims to serve as a comprehensive resource for researchers and practitioners interested in the efficient reduction of model size while maintaining or enhancing performance. The repository contains various papers, summaries, and links related to different pruning approaches for LLMs, along with author information and publication details. It covers a wide range of topics such as structured pruning, unstructured pruning, semi-structured pruning, and benchmarking methods. Researchers and practitioners can explore different pruning techniques, understand their implications, and access relevant resources for further study and implementation.

github

: 141

Medical_Image_Analysis

The Medical_Image_Analysis repository focuses on X-ray image-based medical report generation using large language models. It provides pre-trained models and benchmarks for CheXpert Plus dataset, context sample retrieval for X-ray report generation, and pre-training on high-definition X-ray images. The goal is to enhance diagnostic accuracy and reduce patient wait times by improving X-ray report generation through advanced AI techniques.

github

: 74

agents

Agents 2.0 is a framework for training language agents using symbolic learning, inspired by connectionist learning for neural nets. It implements main components of connectionist learning like back-propagation and gradient-based weight update in the context of agent training using language-based loss, gradients, and weights. The framework supports optimizing multi-agent systems and allows multiple agents to take actions in one node.

github

: 5.2k

vector-search-class-notes

The 'vector-search-class-notes' repository contains class materials for a course on Long Term Memory in AI, focusing on vector search and databases. The course covers theoretical foundations and practical implementation of vector search applications, algorithms, and systems. It explores the intersection of Artificial Intelligence and Database Management Systems, with topics including text embeddings, image embeddings, low dimensional vector search, dimensionality reduction, approximate nearest neighbor search, clustering, quantization, and graph-based indexes. The repository also includes information on the course syllabus, project details, selected literature, and contributions from industry experts in the field.

github

: 316

OpenNARS-for-Applications

OpenNARS-for-Applications is an implementation of a Non-Axiomatic Reasoning System, a general-purpose reasoner that adapts under the Assumption of Insufficient Knowledge and Resources. The system combines the logic and conceptual ideas of OpenNARS, event handling and procedure learning capabilities of ANSNA and 20NAR1, and the control model from ALANN. It is written in C, offers improved reasoning performance, and has been compared with Reinforcement Learning and means-end reasoning approaches. The system has been used in real-world applications such as assisting first responders, real-time traffic surveillance, and experiments with autonomous robots. It has been developed with a pragmatic mindset focusing on effective implementation of existing theory.

github

: 93

chatwise-releases

ChatWise is an offline tool that supports various AI models such as OpenAI, Anthropic, Google AI, Groq, and Ollama. It is multi-modal, allowing text-to-speech powered by OpenAI and ElevenLabs. The tool supports text files, PDFs, audio, and images across different models. ChatWise is currently available for macOS (Apple Silicon & Intel) with Windows support coming soon.

github

: 776

NanoLLM

NanoLLM is a tool designed for optimized local inference for Large Language Models (LLMs) using HuggingFace-like APIs. It supports quantization, vision/language models, multimodal agents, speech, vector DB, and RAG. The tool aims to provide efficient and effective processing for LLMs on local devices, enhancing performance and usability for various AI applications.

github

: 156

automatic-KG-creation-with-LLM

This repository presents a (semi-)automatic pipeline for Ontology and Knowledge Graph Construction using Large Language Models (LLMs) such as Mixtral 8x22B Instruct v0.1, GPT-4o, GPT-3.5, and Gemini. It explores the generation of Knowledge Graphs by formulating competency questions, developing ontologies, constructing KGs, and evaluating the results with minimal human involvement. The project showcases the creation of a KG on deep learning methodologies from scholarly publications. It includes components for data preprocessing, prompts for LLMs, datasets, and results from the selected LLMs.

github

: 171

LMCache

LMCache is a serving engine extension designed to reduce time to first token (TTFT) and increase throughput, particularly in long-context scenarios. It stores key-value caches of reusable texts across different locations like GPU, CPU DRAM, and Local Disk, allowing the reuse of any text in any serving engine instance. By combining LMCache with vLLM, significant delay savings and GPU cycle reduction are achieved in various large language model (LLM) use cases, such as multi-round question answering and retrieval-augmented generation (RAG). LMCache provides integration with the latest vLLM version, offering both online serving and offline inference capabilities. It supports sharing key-value caches across multiple vLLM instances and aims to provide stable support for non-prefix key-value caches along with user and developer documentation.

github

: 680

ReaLHF

ReaLHF is a distributed system designed for efficient RLHF training with Large Language Models (LLMs). It introduces a novel approach called parameter reallocation to dynamically redistribute LLM parameters across the cluster, optimizing allocations and parallelism for each computation workload. ReaL minimizes redundant communication while maximizing GPU utilization, achieving significantly higher Proximal Policy Optimization (PPO) training throughput compared to other systems. It supports large-scale training with various parallelism strategies and enables memory-efficient training with parameter and optimizer offloading. The system seamlessly integrates with HuggingFace checkpoints and inference frameworks, allowing for easy launching of local or distributed experiments. ReaLHF offers flexibility through versatile configuration customization and supports various RLHF algorithms, including DPO, PPO, RAFT, and more, while allowing the addition of custom algorithms for high efficiency.

github

: 192

KAG

KAG is a logical reasoning and Q&A framework based on the OpenSPG engine and large language models. It is used to build logical reasoning and Q&A solutions for vertical domain knowledge bases. KAG supports logical reasoning, multi-hop fact Q&A, and integrates knowledge and chunk mutual indexing structure, conceptual semantic reasoning, schema-constrained knowledge construction, and logical form-guided hybrid reasoning and retrieval. The framework includes kg-builder for knowledge representation and kg-solver for logical symbol-guided hybrid solving and reasoning engine. KAG aims to enhance LLM service framework in professional domains by integrating logical and factual characteristics of KGs.

github

: 6.2k

For similar tasks

alignment-handbook

The Alignment Handbook provides robust training recipes for continuing pretraining and aligning language models with human and AI preferences. It includes techniques such as continued pretraining, supervised fine-tuning, reward modeling, rejection sampling, and direct preference optimization (DPO). The handbook aims to fill the gap in public resources on training these models, collecting data, and measuring metrics for optimal downstream performance.

github

: 4.5k

Xwin-LM

Xwin-LM is a powerful and stable open-source tool for aligning large language models, offering various alignment technologies like supervised fine-tuning, reward models, reject sampling, and reinforcement learning from human feedback. It has achieved top rankings in benchmarks like AlpacaEval and surpassed GPT-4. The tool is continuously updated with new models and features.

github

: 982

Awesome-LLM-Preference-Learning

The repository 'Awesome-LLM-Preference-Learning' is the official repository of a survey paper titled 'Towards a Unified View of Preference Learning for Large Language Models: A Survey'. It contains a curated list of papers related to preference learning for Large Language Models (LLMs). The repository covers various aspects of preference learning, including on-policy and off-policy methods, feedback mechanisms, reward models, algorithms, evaluation techniques, and more. The papers included in the repository explore different approaches to aligning LLMs with human preferences, improving mathematical reasoning in LLMs, enhancing code generation, and optimizing language model performance.

github

: 54

LLM-Synthetic-Data

github

: 101

deepeval

DeepEval is a simple-to-use, open-source LLM evaluation framework specialized for unit testing LLM outputs. It incorporates various metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., and runs locally on your machine for evaluation. It provides a wide range of ready-to-use evaluation metrics, allows for creating custom metrics, integrates with any CI/CD environment, and enables benchmarking LLMs on popular benchmarks. DeepEval is designed for evaluating RAG and fine-tuning applications, helping users optimize hyperparameters, prevent prompt drifting, and transition from OpenAI to hosting their own Llama2 with confidence.

github

: 5.8k

bench

Bench is a tool for evaluating LLMs for production use cases. It provides a standardized workflow for LLM evaluation with a common interface across tasks and use cases. Bench can be used to test whether open source LLMs can do as well as the top closed-source LLM API providers on specific data, and to translate the rankings on LLM leaderboards and benchmarks into scores that are relevant for actual use cases.

github

: 321

TrustLLM

TrustLLM is a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. The document explains how to use the trustllm python package to help you assess the performance of your LLM in trustworthiness more quickly. For more details about TrustLLM, please refer to project website.

github

: 535

llm-autoeval

LLM AutoEval is a tool that simplifies the process of evaluating Large Language Models (LLMs) using a convenient Colab notebook. It automates the setup and execution of evaluations using RunPod, allowing users to customize evaluation parameters and generate summaries that can be uploaded to GitHub Gist for easy sharing and reference. LLM AutoEval supports various benchmark suites, including Nous, Lighteval, and Open LLM, enabling users to compare their results with existing models and leaderboards.

github

: 398

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 855

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.3k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 30.6k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675

LLM-Synthetic-Data

README:

Real-time, Fine-grained LLM-Synthetic-Data

Contents

1. Githubs:

2. Blogs:

3. Surveys:

4. Methods:

4.1. Pre-training

4.2. Instruction Tuning

4.3. Model Collapse

4.4. LLM Benchmarking

4.5. Evaluation

4.6. Distillation

5. Application Areas:

5.1. Mathematical Reasoning

5.2. Code Generation

5.3. Text-to-SQL

5.4. Alignment

5.5. Reward Modeling

5.6. Long Context

5.7. Weak-to-Strong

5.8. Agent and Tool Use

5.9. Vision and Language

5.10. Factuality

5.11. Federated Learning

5.12. Generative Design

5.13 Safety

6. Tools

7. Datasets

Star History

For Tasks:

For Jobs:

Alternative AI tools for LLM-Synthetic-Data

Similar Open Source Tools

LLM-Synthetic-Data

Embodied_AI_Paper_List

EDA-AI

CoachAI-Projects

LLMEvaluation

Awesome-LLM-Prune

Medical_Image_Analysis

agents

vector-search-class-notes

OpenNARS-for-Applications

chatwise-releases

NanoLLM

automatic-KG-creation-with-LLM

LMCache

ReaLHF

KAG

For similar tasks

alignment-handbook

Xwin-LM

Awesome-LLM-Preference-Learning

LLM-Synthetic-Data

deepeval

bench

TrustLLM

llm-autoeval

For similar jobs

weave

LLMStack

VisionCraft

kaito

PyRIT

tabby

spear

Magick