LLM-Synthetic-Data
Real-time, fine-grained LLM synthetic data involves data of, by, and for LLMs.
Stars: 101
LLM-Synthetic-Data is a repository focused on real-time, fine-grained LLM-Synthetic-Data generation. It includes methods, surveys, and application areas related to synthetic data for language models. The repository covers topics like pre-training, instruction tuning, model collapse, LLM benchmarking, evaluation, and distillation. It also explores application areas such as mathematical reasoning, code generation, text-to-SQL, alignment, reward modeling, long context, weak-to-strong generalization, agent and tool use, vision and language, factuality, federated learning, generative design, and safety.
README:
This repo, inspired by Awesome-LLM-Synthetic-Data, which focuses on real-time, fine-grained LLM-Synthetic-Data.
If you find this useful, feel free to follow us and star both repos.
Thanks to all the great contributors on GitHub! 🔥⚡🔥
- Synthetic data: save money, time and carbon with open source. Moritz Laurer. Feb 16, 2024.
- Synthetic data generation (Part 1). Dylan Royan Almeida. Apr 10, 2024
- Synthetic dataset generation techniques: Self-Instruct. Daniel van Strien. May 15, 2024.
- CodecLM: Aligning language models with tailored synthetic data Zifeng Wang and Chen-Yu Lee. May 30, 2024.
- The Rise of Agentic Data Generation. Maxime Labonne. July 15, 2024.
- LLM-Driven Synthetic Data Generation, Curation & Evaluation. Cobus Greyling. Aug 2, 2024.
- Using LLMs for Synthetic Data Generation: The Definitive Guide Kritin Vongthongsri. November 8, 2024.
- Best Practices and Lessons Learned on Synthetic Data for Language Models. Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, Andrew M. Dai. COLM 2024.
- On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey. Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, Haobo Wang. ACL Findings 2024.
- Large Language Models for Data Annotation: A Survey Zhen Tan, Dawei Li, Song Wang, Alimohammad Beigi, Bohan Jiang, Amrita Bhattacharjee, Mansooreh Karami, Jundong Li, Lu Cheng, Huan Liu. EMNLP 2024.
- Generative AI for Synthetic Data Generation: Methods, Challenges and the Future. Xu Guo, Yiqiang Chen. Arxiv 2024.
- Comprehensive Exploration of Synthetic Data Generation: A Survey. André Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle Chard, Ian Foster. Arxiv 2024.
- Phi-4 Technical Report Microsoft Research. Arxiv 2024.
- Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Arxiv 2024
- MAmmoTH2: Scaling Instructions from the Web. Xiang Yue, Tuney Zheng, Ge Zhang, Wenhu Chen. Neurips 2024.
- Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent Tecent Hunyuan Team. Arxiv 2024.
- STaR: Bootstrapping Reasoning With Reasoning Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman. NeurIPS 2022.
- Generating Training Data with Language Models: Towards Zero-Shot Language Understanding Yu Meng, Jiaxin Huang, Yu Zhang, Jiawei Han. NeurIPS 2022.
- ZeroGen: Efficient Zero-shot Learning via Dataset Generation Jiacheng Ye, Jiahui Gao, Qintong Li, Hang Xu, Jiangtao Feng, Zhiyong Wu, Tao Yu, Lingpeng Kong. EMNLP 2022.
- Symbolic Knowledge Distillation: from General Language Models to Commonsense Models Peter West, Chandra Bhagavatula, Jack Hessel, Jena D. Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu, Sean Welleck, Yejin Choi. NAACL 2022.
- Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay Krishna, Jiaming Shen, Chao Zhang. NeurIPS D&B 2023.
- Self-instruct: Aligning language models with self-generated instructions. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi. ACL 2023.
- Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions John Joon Young Chung, Ece Kamar, Saleema Amershi. ACL 2023.
- Large Language Models Can Self-Improve Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han. EMNLP 2023.
- Making Large Language Models Better Data Creators Dong-Ho Lee, Jay Pujara, Mohit Sewak, Ryen W. White, Sujay Kumar Jauhar. EMNLP 2023.
- Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science Veniamin Veselovsky, Manoel Horta Ribeiro, Akhil Arora, Martin Josifoski, Ashton Anderson, Robert West, Arxiv 2023.
- Self-Rewarding Language Models. Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston. ICML 2024.
- Scaling Synthetic Data Creation with 1,000,000,000 Personas. Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, Dong Yu. Arxiv 2024.
- Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation Nihal V. Nayak, Yiyang Nan, Avi Trost, Stephen H. Bach. ACL Findings 2024.
- Rephrasing theWeb A Recipe for Compute and Data-Efficient Language Modeling Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly. ACL 2024.
- Automatic Instruction Evolving for Large Language Models. Weihao Zeng, Can Xu, Yingxiu Zhao, Jian-Guang Lou, Weizhu Chen. EMNLP 2024.
- Self-playing Adversarial Language Game Enhances LLM Reasoning Pengyu Cheng, Tianhao Hu, Han Xu, Zhisong Zhang, Yong Dai, Lei Han, Nan Du. Neurips 2024.
- WizardLM: Empowering Large Language Models to Follow Complex Instructions. Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Daxin Jiang. ICLR 2024.
- CodecLM: Aligning Language Models with Tailored Synthetic Data. Zifeng Wang, Chun-Liang Li, Vincent Perot, Long T. Le, Jin Miao, Zizhao Zhang, Chen-Yu Lee, Tomas Pfister. NAACL Findings 2024.
- TarGEN: Targeted Data Generation with Large Language Models Himanshu Gupta, Kevin Scaria, Ujjwala Anantheswaran, Shreyas Verma, Mihir Parmar, Saurabh Arjun Sawant, Chitta Baral, Swaroop Mishra. COLM 2024.
- Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang, Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang, Yuxian Gu, Xin Cheng, Xun Wang, Si-Qing Chen, Li Dong, Wei Lu, Zhifang Sui, Benyou Wang, Wai Lam, Furu Wei. Submit to ICLR 2025.
- Forewarned is Forearmed: Harnessing LLMs for Data Synthesis via Failure-induced Exploration. Submit to ICLR 2025.
- Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, Bill Yuchen Lin. Submit to ICLR 2025.
- Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli. Submit to ICLR 2025.
- AI models collapse when trained on recursively generated data Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson & Yarin Gal, Nature 2024.
- How bad is training on synthetic data? A statistical analysis of language model collapse. Mohamed El Amine Seddik, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, Merouane Abdelkader DEBBAH. COLM 2024.
- ToEdit: How to Synthesize Text Data to Avoid Model Collapse? Submit to ICLR 2025.
- Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement Yunzhen Feqng, Elvis Dohmatob, Pu Yang, Francois Charton, Julia Kempe. Submit to ICLR 2025.
- DataGen: Unified Synthetic Dataset Generation via Large Language Models Submit to ICLR 2025.
- On the Diversity of Synthetic Data and its Impact on Training Large Language Models Hao Chen, Abdul Waheed, Xiang Li, Yidong Wang, Jindong Wang, Bhiksha Raj, Marah I. Abdin. Submit to ICLR 2025.
- FreeAL: Towards Human-Free Active Learning in the Era of Large Language Models, Ruixuan Xiao, Yiwen Dong, Junbo Zhao, Runze Wu, Minmin Lin, Gang Chen, Haobo Wang. EMNLP 2023.
- Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models Ruida Wang, Wangchunshu Zhou, Mrinmaya Sachan, EMNLP finding2023
- Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations, Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, Ming Yin. EMNLP finding2023.
- Distilling LLMs' Decomposition Abilities into Compact Language Models Denis Tarasov, Kumar Shridhar. AutoRL@ICML 2024.
- MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning Chengpeng Li, Zheng Yuan, Hongyi Yuan, Guanting Dong, Keming Lu, Jiancan Wu, Chuanqi Tan, Xiang Wang, Chang Zhou. ACL 2024.
- MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs Zimu Lu, Aojun Zhou, Houxing Ren, Ke Wang, Weikang Shi, Junting Pan, Mingjie Zhan, Hongsheng Li. ACL 2024.
- MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, Weiyang Liu. ICLR 2024.
- Augmenting Math Word Problems via Iterative Question Composing Haoxiong Liu, Yifan Zhang, Yifan Luo, Andrew Chi-Chih Yao. DPFM@ICLR 2024.
- Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning Yiming Huang, Xiao Liu, Yeyun Gong, Zhibin Gou, Yelong Shen, Nan Duan, Weizhu Chen. Arxiv 2024.
- Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, Junxian He. NeurIPS 2024.
- Program Synthesis with Large Language Models Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, Charles Sutton. Arxiv 2021.
- CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi. NeurIPS 2022.
- InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback John Yang, Akshara Prabhakar, Karthik Narasimhan, Shunyu Yao. Arxiv 2023.
- Language Models Can Teach Themselves to Program Better Patrick Haluptzok, Matthew Bowers, Adam Tauman Kalai. ICLR 2023.
- CODEGEN: AN OPEN LARGE LANGUAGE MODEL FOR CODE WITH MULTI-TURN PROGRAM SYNTHESIS. ICLR2023.
- Code Alpaca: An Instruction-following LLaMA Model trained on code generation instructions Sahil Chaudhary. GitHub 2023.
- Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models Somshubra Majumdar, Vahid Noroozi, Sean Narenthiran, Aleksander Ficek, Jagadeesh Balam, Boris Ginsburg. Arxiv 2024.
- Magicoder: Empowering Code Generation with OSS-Instruct Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, Lingming Zhang. ICML 2024.
- WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning Zhaojian Yu, Xin Zhang, Ning Shang, Yangyu Huang, Can Xu, Yishujie Zhao, Wenxiang Hu, Qiufeng Yin. ACL 2024.
- WizardCoder: Empowering Code Large Language Models with Evol-Instruct Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, Daxin Jiang. ICLR 2024.
- Learning Performance-Improving Code Edits Alexander Shypula, Aman Madaan, Yimeng Zeng, Uri Alon, Jacob Gardner, Milad Hashemi, Graham Neubig, Parthasarathy Ranganathan, Osbert Bastani, Amir Yazdanbakhsh. ICLR 2024.
- InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct Yutong Wu, Di Huang, Wenxuan Shi, Wei Wang, Lingzhe Gao, Shihao Liu, Ziyuan Nan, Kaizhao Yuan, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Yewen Pu, Dawei Yin, Xing Hu, Yunji Chen. Arxiv 2024.
- OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, Xiang Yue. Arxiv 2024.
- AutoCoder: Enhancing Code Large Language Model with AIEV-Instruct Bin Lei, Yuchen Li, Qiuwu Chen. Arxiv 2024.
- How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data Yejie Wang, Keqing He, Dayuan Fu, Zhuoma Gongque, Heyang Xu, Yanxu Chen, Zhexu Wang, Yujia Fu, Guanting Dong, Muxi Diao, Jingang Wang, Mengdi Zhang, Xunliang Cai, Weiran Xu. Arxiv 2024.
- SelfCodeAlign: Self-Alignment for Code Generation Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro von Werra, Arjun Guha, Lingming Zhang. Arxiv 2024.
- Synthesizing Text-to-SQL Data from Weak and Strong LLMs Jiaxi Yang, Binyuan Hui, Min Yang, Jian Yang, Junyang Lin, Chang Zhou. ACL 2024.
- Synthetic-Text-To-SQL: A synthetic dataset for training language models to generate SQL queries from natural language prompts Meyer, Yev and Emadi, Marjan and Nathawani, Dhruv and Ramaswamy, Lipika and Boyd, Kendrick and Van Segbroeck, Maarten and Grossman, Matthew and Mlocek, Piotr and Newberry, Drew. Huggingface 2024.
- Constitutional AI: Harmlessness from AI Feedback Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan. Arxiv 2022.
- Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan. NeurIPS 2023.
- SALMON: Self-Alignment with Instructable Reward Models Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan. ICLR 2024.
- Refined Direct Preference Optimization with Synthetic Data for Behavioral Alignment of LLMs V´ıctor Gallego. Arxiv 2024.
- Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models. Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, Jingren Zhou. Submit to ICLR 2025.
- Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Tim Rocktäschel, Roberta Raileanu. NeurIPS 2024.
- Self-Alignment with Instruction Backtranslation Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Omer Levy, Luke Zettlemoyer, Jason Weston, Mike Lewis. ICLR 2024.
- West-of-N: Synthetic Preference Generation for Improved Reward Modeling. Alizée Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, Aliaksei Severyn. Arxiv 2024.
- Make Your LLM Fully Utilize the Context. Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou. Arxiv 2024.
- From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data. Zheyang Xiong, Vasilis Papageorgiou, Kangwook Lee, Dimitris Papailiopoulos. Submit to ICLR 2025.
- Scaling Instruction-tuned LLMs to Million-token Contexts via Hierarchical Synthetic Data Generation Submit to ICLR 2025.
- Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, Jeffrey Wu. ICML 2024.
- Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models. Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan Gu. ICML 2024.
- Impossible Distillation for Paraphrasing and Summarization: How to Make High-quality Lemonade out of Small, Low-quality Models Jaehun Jung, Peter West, Liwei Jiang, Faeze Brahman, Ximing Lu, Jillian Fisher, Taylor Sorensen, Yejin Choi. NAACL 2024.
- ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases. Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, Le Sun. Arxiv 2023.
- Toolformer: Language Models Can Teach Themselves to Use Tools. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom. NeurIPS 2023.
- GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction. Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, Ying Shan. Neurips 2023.
- Gorilla: Large Language Model Connected with Massive APIs. Shishir G. Patil, Tianjun Zhang, Xin Wang, Joseph E. Gonzalez. NeurIPS 2024.
- Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs. Shadi Iskander, Nachshon Cohen, Zohar Karnin, Ori Shapira, Sofia Tolmach. EMNLP 2024.
- Voyager: An Open-Ended Embodied Agent with Large Language Models. Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, Anima Anandkumar. TMLR 2024.
- Visual Instruction Tuning Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee. NeurIPS 2023.
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou. Arxiv 2023.
- G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, Lingpeng Kong. Sumit to ICLR 2025.
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny. ICLR 2024.
- Enhancing Large Vision Language Models with Self-Training on Image Comprehension Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, James Zou, Kai-Wei Chang, Wei Wang. NeurIPS 2024.
- LLaVA-OneVision: Easy Visual Task Transfer Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li. Submit to TMLR.
- MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents Liyan Tang, Philippe Laban, Greg Durrett. EMNLP 2024.
- Fine-tuning Language Models for Factuality Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D. Manning, Chelsea Finn. ICLR 2024.
- Prompt Public Large Language Models to Synthesize Data for Private On-device Applications. Shanshan Wu, Zheng Xu, Yanxiang Zhang, Yuanbo Zhang, Daniel Ramage. COLM 2024.
- Harnessing large-language models to generate private synthetic text. Alexey Kurakin, Natalia Ponomareva, Umar Syed, Liam MacDermed, Andreas Terzis. Arxiv 2024.
- Generative Design through Quality-Diversity Data Synthesis and Language Models. Adam Gaier, James Stoddart, Lorenzo Villaggi, Shyam Sudhakaran. GECCO 2024.
- SynthPAI: A Synthetic Dataset for Personal Attribute Inference Hanna Yukhymenko, Robin Staab, Mark Vero, Martin Vechev. NeurIPS D&B 2024.
- DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows. Ajay Patel, Colin Raffel, Chris Callison-Burch. ACL 2024.
- AgentInstruct: Toward Generative Teaching with Agentic Flows. Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah. Arxiv 2024.
- Distilabel: An AI Feedback (AIF) Framework for Building Datasets with and for LLMs. Álvaro Bartolomé Del Canto, Gabriel Martín Blázquez, Agustín Piqueres Lajarín and Daniel Vila Suero. GitHub 2024.
- Fuxion: Synthetic Data Generation and Normalization Functions using Langchain + LLMs.
- Open Artificial Knowledge Vadim Borisov, Richard Schreiber. ICML Workshop 2024.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for LLM-Synthetic-Data
Similar Open Source Tools
LLM-Synthetic-Data
LLM-Synthetic-Data is a repository focused on real-time, fine-grained LLM-Synthetic-Data generation. It includes methods, surveys, and application areas related to synthetic data for language models. The repository covers topics like pre-training, instruction tuning, model collapse, LLM benchmarking, evaluation, and distillation. It also explores application areas such as mathematical reasoning, code generation, text-to-SQL, alignment, reward modeling, long context, weak-to-strong generalization, agent and tool use, vision and language, factuality, federated learning, generative design, and safety.
Awesome-LLM-Strawberry
Awesome LLM Strawberry is a collection of research papers and blogs related to OpenAI Strawberry(o1) and Reasoning. The repository is continuously updated to track the frontier of LLM Reasoning.
EDA-AI
EDA-AI is a repository containing implementations of cutting-edge research papers in the field of chip design. It includes DeepPlace, PRNet, HubRouter, and PreRoutGNN models for tasks such as placement, routing, timing prediction, and global routing. Researchers and practitioners can leverage these implementations to explore advanced techniques in chip design.
CoachAI-Projects
This repo contains official implementations of **Coach AI Badminton Project** from Advanced Database System Laboratory, National Yang Ming Chiao Tung University supervised by Prof. Wen-Chih Peng. The high-level concepts of each project are as follows: 1. Visualization Platform published at _Physical Education Journal 2020_ aims to construct a platform that can be used to illustrate the data from matches. 2. Shot Influence and Extension Work published at _ICDM-21_ and _ACM TIST 2022_ , respectively introduce a framework with a shot encoder, a pattern extractor, and a rally encoder to capture long short-term dependencies for evaluating players' performance of each shot. 3. Stroke Forecasting published at _AAAI-22_ proposes the first stroke forecasting task to predict the future strokes of both players based on the given strokes by ShuttleNet, a position-aware fusion of rally progress and player styles framework. 4. Strategic Environment published at _AAAI-23 Student Abstract_ designs a safe and reproducible badminton environment for turn-based sports, which simulates rallies with different angles of view and designs the states, actions, and training procedures. 5. Movement Forecasting published at _AAAI-23_ proposes the first movement forecasting task, which contains not only the goal of stroke forecasting but also the movement of players, by DyMF, a novel dynamic graphs and hierarchical fusion model based on the proposed player movements (PM) graphs. 6. CoachAI-Challenge-IJCAI2023 is a badminton challenge (CC4) hosted at _IJCAI-23_. Please find the website for more details. 7. ShuttleSet published at _KDD-23_ is the largest badminton singles dataset with stroke-level records. - An extension dataset ShuttleSet22 published at _IJCAI-24 Demo & IJCAI-23 IT4PSS Workshop_ is also released. 8. CoachAI Badminton Environment published at _AAAI-24 Student Abstract and Demo, DSAI4Sports @ KDD 2023_ is a reinforcement learning (RL) environment tailored for AI-driven sports analytics, offering: i) Realistic opponent simulation for RL training; ii) Visualizations for evaluation; and iii) Performance benchmarks for assessing agent capabilities.
FedLLM-Bench
FedLLM-Bench is a realistic benchmark for the Federated Learning of Large Language Models community. It includes datasets for federated instruction tuning and preference alignment tasks, exhibiting diversities in language, quality, quantity, instruction, sequence length, embedding, and preference. The repository provides training scripts and code for open-ended evaluation, aiming to facilitate research and development in federated learning of large language models.
RAM
This repository, RAM, focuses on developing advanced algorithms and methods for Reasoning, Alignment, Memory. It contains projects related to these areas and is maintained by a team of individuals. The repository is licensed under the MIT License.
Awesome-LLM-Prune
This repository is dedicated to the pruning of large language models (LLMs). It aims to serve as a comprehensive resource for researchers and practitioners interested in the efficient reduction of model size while maintaining or enhancing performance. The repository contains various papers, summaries, and links related to different pruning approaches for LLMs, along with author information and publication details. It covers a wide range of topics such as structured pruning, unstructured pruning, semi-structured pruning, and benchmarking methods. Researchers and practitioners can explore different pruning techniques, understand their implications, and access relevant resources for further study and implementation.
AI_Hospital
AI Hospital is a research repository focusing on the interactive evaluation and collaboration of Large Language Models (LLMs) as intern doctors for clinical diagnosis. The repository includes a simulation module tailored for various medical roles, introduces the Multi-View Medical Evaluation (MVME) Benchmark, provides dialog history documents of LLMs, replication instructions, performance evaluation, and guidance for creating intern doctor agents. The collaborative diagnosis with LLMs emphasizes dispute resolution. The study was authored by Zhihao Fan, Jialong Tang, Wei Chen, Siyuan Wang, Zhongyu Wei, Jun Xie, Fei Huang, and Jingren Zhou.
chatwise-releases
ChatWise is an offline tool that supports various AI models such as OpenAI, Anthropic, Google AI, Groq, and Ollama. It is multi-modal, allowing text-to-speech powered by OpenAI and ElevenLabs. The tool supports text files, PDFs, audio, and images across different models. ChatWise is currently available for macOS (Apple Silicon & Intel) with Windows support coming soon.
Medical_Image_Analysis
The Medical_Image_Analysis repository focuses on X-ray image-based medical report generation using large language models. It provides pre-trained models and benchmarks for CheXpert Plus dataset, context sample retrieval for X-ray report generation, and pre-training on high-definition X-ray images. The goal is to enhance diagnostic accuracy and reduce patient wait times by improving X-ray report generation through advanced AI techniques.
ebook2audiobook
ebook2audiobook is a CPU/GPU converter tool that converts eBooks to audiobooks with chapters and metadata using tools like Calibre, ffmpeg, XTTSv2, and Fairseq. It supports voice cloning and a wide range of languages. The tool is designed to run on 4GB RAM and provides a new v2.0 Web GUI interface for user-friendly interaction. Users can convert eBooks to text format, split eBooks into chapters, and utilize high-quality text-to-speech functionalities. Supported languages include Arabic, Chinese, English, French, German, Hindi, and many more. The tool can be used for legal, non-DRM eBooks only and should be used responsibly in compliance with applicable laws.
agents
Agents 2.0 is a framework for training language agents using symbolic learning, inspired by connectionist learning for neural nets. It implements main components of connectionist learning like back-propagation and gradient-based weight update in the context of agent training using language-based loss, gradients, and weights. The framework supports optimizing multi-agent systems and allows multiple agents to take actions in one node.
OpenNARS-for-Applications
OpenNARS-for-Applications is an implementation of a Non-Axiomatic Reasoning System, a general-purpose reasoner that adapts under the Assumption of Insufficient Knowledge and Resources. The system combines the logic and conceptual ideas of OpenNARS, event handling and procedure learning capabilities of ANSNA and 20NAR1, and the control model from ALANN. It is written in C, offers improved reasoning performance, and has been compared with Reinforcement Learning and means-end reasoning approaches. The system has been used in real-world applications such as assisting first responders, real-time traffic surveillance, and experiments with autonomous robots. It has been developed with a pragmatic mindset focusing on effective implementation of existing theory.
MyScaleDB
MyScaleDB is a SQL vector database optimized for AI applications, enabling developers to manage and process massive volumes of data efficiently. It offers fast and powerful vector search, filtered search, and SQL-vector join queries, making it fully SQL-compatible. MyScaleDB provides unmatched performance and scalability by leveraging cutting-edge OLAP database architecture and advanced vector algorithms. It is production-ready for AI applications, supporting structured data, text, vector, JSON, geospatial, and time-series data. MyScale Cloud offers fully-managed MyScaleDB with premium features on billion-scale data, making it cost-effective and simpler to use compared to specialized vector databases. Built on top of ClickHouse, MyScaleDB combines structured and vector search efficiently, ensuring high accuracy and performance in filtered search operations.
myscaledb
MyScaleDB is a SQL vector database designed for scalable AI applications, enabling developers to efficiently manage and process massive volumes of data using familiar SQL. It offers fast and efficient vector search, filtered search, and SQL-vector join queries. MyScaleDB is fully SQL-compatible and production-ready for AI applications, providing unmatched performance and scalability through cutting-edge OLAP architecture and advanced vector algorithms. Built on top of ClickHouse, it combines structured and vectorized data management for high accuracy and speed in filtered searches.
For similar tasks
alignment-handbook
The Alignment Handbook provides robust training recipes for continuing pretraining and aligning language models with human and AI preferences. It includes techniques such as continued pretraining, supervised fine-tuning, reward modeling, rejection sampling, and direct preference optimization (DPO). The handbook aims to fill the gap in public resources on training these models, collecting data, and measuring metrics for optimal downstream performance.
Xwin-LM
Xwin-LM is a powerful and stable open-source tool for aligning large language models, offering various alignment technologies like supervised fine-tuning, reward models, reject sampling, and reinforcement learning from human feedback. It has achieved top rankings in benchmarks like AlpacaEval and surpassed GPT-4. The tool is continuously updated with new models and features.
Awesome-LLM-Preference-Learning
The repository 'Awesome-LLM-Preference-Learning' is the official repository of a survey paper titled 'Towards a Unified View of Preference Learning for Large Language Models: A Survey'. It contains a curated list of papers related to preference learning for Large Language Models (LLMs). The repository covers various aspects of preference learning, including on-policy and off-policy methods, feedback mechanisms, reward models, algorithms, evaluation techniques, and more. The papers included in the repository explore different approaches to aligning LLMs with human preferences, improving mathematical reasoning in LLMs, enhancing code generation, and optimizing language model performance.
LLM-Synthetic-Data
LLM-Synthetic-Data is a repository focused on real-time, fine-grained LLM-Synthetic-Data generation. It includes methods, surveys, and application areas related to synthetic data for language models. The repository covers topics like pre-training, instruction tuning, model collapse, LLM benchmarking, evaluation, and distillation. It also explores application areas such as mathematical reasoning, code generation, text-to-SQL, alignment, reward modeling, long context, weak-to-strong generalization, agent and tool use, vision and language, factuality, federated learning, generative design, and safety.
deepeval
DeepEval is a simple-to-use, open-source LLM evaluation framework specialized for unit testing LLM outputs. It incorporates various metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., and runs locally on your machine for evaluation. It provides a wide range of ready-to-use evaluation metrics, allows for creating custom metrics, integrates with any CI/CD environment, and enables benchmarking LLMs on popular benchmarks. DeepEval is designed for evaluating RAG and fine-tuning applications, helping users optimize hyperparameters, prevent prompt drifting, and transition from OpenAI to hosting their own Llama2 with confidence.
bench
Bench is a tool for evaluating LLMs for production use cases. It provides a standardized workflow for LLM evaluation with a common interface across tasks and use cases. Bench can be used to test whether open source LLMs can do as well as the top closed-source LLM API providers on specific data, and to translate the rankings on LLM leaderboards and benchmarks into scores that are relevant for actual use cases.
TrustLLM
TrustLLM is a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. The document explains how to use the trustllm python package to help you assess the performance of your LLM in trustworthiness more quickly. For more details about TrustLLM, please refer to project website.
llm-autoeval
LLM AutoEval is a tool that simplifies the process of evaluating Large Language Models (LLMs) using a convenient Colab notebook. It automates the setup and execution of evaluations using RunPod, allowing users to customize evaluation parameters and generate summaries that can be uploaded to GitHub Gist for easy sharing and reference. LLM AutoEval supports various benchmark suites, including Nous, Lighteval, and Open LLM, enabling users to compare their results with existing models and leaderboards.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.