LLMs

专注于中文领域大语言模型，落地到某个行业某个领域，成为一个行业大模型、公司级别或行业级别领域大模型。

Stars: 97

Visit

LLMs is a Chinese large language model technology stack for practical use. It includes high-availability pre-training, SFT, and DPO preference alignment code framework. The repository covers pre-training data cleaning, high-concurrency framework, SFT dataset cleaning, data quality improvement, and security alignment work for Chinese large language models. It also provides open-source SFT dataset construction, pre-training from scratch, and various tools and frameworks for data cleaning, quality optimization, and task alignment.

README:

中文领域大语言模型实操技术全栈

Scaling Law for SFT & DPO

coming soon...

领域LLMs的数据模型 for SFT & DPO

分享一个观点："Without a well thought-out data model, organizations will be constantly fighting short-term data engineering problems and connecting the business needs with data requirements will be extremely difficult. "如果没有一个经过深思熟虑的数据模型，组织将不断地与短期数据工程问题作斗争，将业务需求与数据需求联系起来将是极其困难的。"
来源网页需要付费账户，不再注明。
数据生产方案，是从原始数据到可训练数据的一整套流程，包括原始数据筛选（data selection）、数据清洗、数据embedding、数据质量打分、数据难度打分、数据Tag等等，利用这些特征将数据集分层，之后筛选进入SFT训练的数据集，偏好数据的种子从中选择，经过几十个模型生产数据，构造偏好关系，得到用于偏好对齐的训练数据。

领域LLMs的有效训练方案 for SFT & DPO

训练方案将SFT和DPO串联起来，一轮SFT后紧接着一轮DPO，往复迭代三轮，每一轮都增加新数据进来，构造更有难度的偏好数据。

预训练、SFT和DPO偏好对齐代码框架

已经在36台A100，288块GPU卡上实际跑过预训练、增量预训练和SFT，涉及1.5T的tokens训练，等待开源...

SFT代码开源

地址：https://github.com/xubuvd/PhasedSFT
代码在16台A100、A800和H100机器上，经受过了大规模的SFT训练，数据量最多达 200万条，已经训练过的模型有Llama1、Llama2、Llama3、Qwen1、Qwen1.5、Qwen2 和 Mistral，稳定可用的。

偏好训练 Post-Training with Direct Preference Optimization

地址：codes_datasets/Postraining_dpo
用法参考：codes_datasets/Postraining_dpo/README.md

代码在16台A100、A800和H100机器上训练过，DPO偏好数据最多到50万条，稳定可用。

预训练数据清洗源代码和高并发框架

大规模预训练数据集数据集清洗，具有可读性好、可用性好，Python多线程并发，包括一套清洗工程规范，清洗策略，多线程框架，文本去重代码，黑名单词典等内容。
2024.08.13，开源出来一部分基于启发式规则的多线程python清洗代码，代码复用性、可用性好：codes_datasets/DataCleaning
清洗代码唯一的启动脚本：

nohup bash run_data_cleaning.sh > r.log 2>&1 &

停止程序的脚本：

bash stopall.sh

SFT 数据集清洗和数据质量提升代码框架

🍂step1: 数据清洗，也可复用预训练数据清洗代码
🍂step2: 数据质量提升
1.指令质量打分
2.指令难度打分
3.聚类+语义去重复
代码整理中，等待开源...

中文大语言模型的安全对齐工作，用于安全审核评估

一千条安全类指令数据集:codes_datasets/SFTData/cn-Safety-Prompts-gpt12k_baichuan1K.jsonl
一千条安全类中文指令数据集,关于诈骗、欺骗、个人安全、个人攻击性歧视性仇恨言论、黄赌毒等类型: codes_datasets/SFTData/cn-Chinese-harmlessness-1K.jsonl

开源sft数据集构造

1. 大规模COT高中试题数据集，命名为“cn-sft-exams-highSchool-1M.jsonl”

开源超大规模高中试题指令数据集，100万条中文指令数据，涵盖语文、数学、物理、化学、地理、历史、政治和英文。
已上传：cn-sft-exams-highSchool-1k.jsonl
指令格式：{"id": "26069", "data": ["问题","答案"]}

2. SFT数据质量优化

数据质量和多样性，这两条是模型能力的来源。
优化前：

优化后：
上述数据分布图生成代码：codes_datasets/utils/tsne_cluster_show.py

2. 行政职业能力测验题数据集，命名为“cn-sft-CS-APAT-30K.jsonl”

共 3 万条行测试题，逻辑推理题目，旨在提升模型的逻辑能力。
已上传：cn-exam-high-school-5W.jsonl.zip
指令格式：{"id": "26069", "data": ["问题","答案"]}

落地大语言模型LLM，关键问题是在某个垂直领域，如何构造高质量指令数据集？有方法论吗？

有的！

从0到1预训练大语言模型

预训练框架：基于 DeepSpeed + HuggFace Trainer 研发框架
模型结构： LLaMA；

论文引用请参考

@online{CnSftData,
  author = {XuBu},
  year = {2024},
  title = {Chinese Large Language Models},
  url = {https://github.com/xubuvd/LLMs},
  month = {APR},
  lastaccessed = {APR 19, 2024}
}

sft微调训练和强化学习训练

SFT：使用上一步的预训练框架 (弃用DeepSpeedChat，因为它不支持大规模数据训练，存在很多问题)
RLHF框架：使用优化过的 DeepSpeedChat 进行训练

DPO 偏好数据对齐

DPO非常有效，目前在3万偏好数据集上，测试13B、70B和7B模型，效果非常显著；
DPO版本相对于SFT模型，胜率提升了「10个」百分点以上，人的观感收益也很显著。具体的，在业务上的准确率（accuracy），相对于SFT模型，提升了「14.6」个百分点，F1值，提升了「13.5」个百分点。

DPO偏序数据集构造

DPO训练，关键指标图示(一部分)：

预训练数据收集和清洗

数据收集和清洗，对大语言模型（LLM）最终效果的影响极度重要。
数据清洗需要一套方法论，预训练数据的三项关键指标：质量高、多样性和数量大。

什么是质量高？

一大段自然语言文本，语法上连贯流畅，没有插入无关的词汇、句子，语义上完整；
标点符号符合行文规范；

什么是多样性？

覆盖各个通用领域和学科的知识
构建一个全面的「行业领域知识体系」
整合上述各类信息源，确保广泛的知识覆盖

什么是数量大？

总量
- Pretrain 数据总量大，以LLaMA tokens计算，约 1~2 T tokens 左右
分量
- 从数据多样性上看，各种数据类型的数据都有，大小基本符合互联网上的数据自然分布

交流群

如果你也对本项目感兴趣，欢迎加入群聊参与讨论交流。

基于DeepSpeedChat改造，用于RLHF训练的框架

No.	Bug	原做法	修改	注评
1	Loss计算方式	所有tokens的预测损失（CE loss）	只计算模型respnse部分的预测损失	计算所有token的loss，效果不好，只计算模型response的loss，其它部分mask掉
2	新增pre-train和 SFT两种损失Loss计算	只有loss计算一种	增加pre-train预训练	支持SFT和Pre-train混合训练，同一个batch内部有两类数据
3	不作为一个特殊字符	作为一个文本序列	使用特殊字符代替，不需要新加一个	参考论文“A General Language Assistant as a Laboratory for Alignment”，用作特殊字符效果好一些。
4	subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.	g++ wasn't installed.	#apt-get install build-essential	g++环境问题
5	wandb.errors.UsageError: api_key not configured (no-tty).		#wandb login 根据提示获取api key注册一下即可	wandb使用问题，退出后再进入要：$ wandb login --relogin
6	Calling torch.distributed.barrier() results in the program being killed	#df -lh	#rm -f /dev/shm/nccl-*	Docker容器共享内存太小存满导致，容器里跑训练会遇到， Slurm集群里，面对裸机没有此类问题。
7	huggingface/tokenizers: The current process just got forked, after parallelism has already been used.			warning，暂不处理
8	数据集索引大小的bug			2982929829一个不可能出现的数字， index缓存文件名字名字重复，加入子进程的 global rank, loacl rank命名，已解决。
9	wandb: ERROR Run initialization has timed out after 60.0 sec.		两个可能原因： 1，某些node的网络没有打开导致的； 2，节点的网络中断；上述两个原因都遇到过。	排查两个原因
10	OSError: [Errno 122] Disk quota exceeded	模型文件checkpoint写到管理节点本地，仅保存了4个checkpoints，空间就🈵️了， pytorch_model_10.bin, pytorch_model_20.bin, pytorch_model_30.bin, pytorch_model_40.bin	1. checkpoints先保存在/hpc_data/pangwei/ 【因为写权限问题，先保存该目录下】，速度变慢，10分钟加载模型文件； 2. 保留当前三个checkpoints； 3. 保存历史上最好的一个checkpoint，根据验证集上的perplexity指标。checkpoints分为三种，后缀分别为：norm_{steps}, bestppl_{steps}, final_{steps}。	磁盘配额不够了，磁盘已满或超出了用户所能使用的配额上限
11	混合训，支持任意多个训练数据文件	支持一类数据集读取	支持四类不同数据集，每一类可以任意多： --train_pt_data_path [] --eval_pt_data_path [] --train_sft_data_path [] --eval_sft_data_path [] 预训练数据集，后缀：训练集pt_train.jsonl, 验证集 pt_eval.jsonl; 指令微调数据集，后缀：训练集 sft_train.jsonl, 验证集 sft_eval.jsonl。	支持混合训的数据集管理，便于不同数据集的配比
12	resume问题		1）保存 checkpoint 元信息，包括 epoch, global step, optimizer, checkpoints file name； 2）resume 继续训练，断点重新训练。	加载当前最新的一个checkpoint；
13	(ReqNodeNotAvail, Un) scancel一个任务又重新启动会遇到此类错误		slurm系统scancel任务后挂掉	重启slurm吧
14	缓存空间溢满 OSError: [Errno 28] No space left on device: '/tmp/data_files'		从/tmp/目录调整到/data/XXX/目录
15	Save checkpoints，按照固定steps计算perplexity，保存最优模型	每个epoch结束后才计算perplexity	增加一个参数 args.eval_save_steps，默认100
16	Save checkpoint 并行化	checkpoint 路径全局唯一，如果在多个节点（gnode）上启动任务，输出路径重合，互相干扰输出结果	Save 路径上新增一个节点名字，可以同时多机启动多个一样的任务。路径例子：/hpc_data/XXX/ actor-models/chinese_llama_plus-gnode07-13b-gnode07-20230524-0356。	chinese_llama_plus-gnode07-13b-gnode07-20230524-0356：模型名字-模型大小-节点名字-时间串
17	加入wandb			训练参数，资源和指标可视化 f"{model_name}-{JOB_ID}"
18	加入TGS,TFlops			卡吞吐率
19	统计预训练和指令微调数据集的总tokens数量			60W条SFT数据集 Total tokens for pre-training: 0 Total tokens for sft: 51166867 Total tokens: 51166867
20	自动记录训练的超参数 Copy run_*.sh到输出目录下		copy2(script, f'{output_dir}/run.sh')	模型的超参数与模型checkpoint文件保存在一起，便于分析模型性能与参数的关系
21	RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using ulimit -n in the shell or change the sharing strategy by calling torch.multiprocessing.set_sharing_strategy('file_system') at the beginning of your code		torch.multiprocessing. set_sharing_strategy ('file_system')	快速文件存储系统的设置存在问题
22	缺少多机多卡支持			deepspeed-chat没有写这部分，增加进行中
23	torch.cuda.OutOfMemoryError: CUDA out of memory.	Chinese-LLaMA-13B的模型训练遇到OOM问题		在epoch 循环的内部，进行了 evaluation()， evaluation设置了model.eval()模式, 但是退出evaluation再次进入 epoch 循环时，没有设置model.train()模式。
24	A100-PCIe主板接口网速瓶颈			训练正向进行和反向进行， GPU达到峰值，多卡之间信息同步时， GPU利用率跌倒谷底。 NV-link的接口GPU平稳。
25	Index cache 文件名字重合	下一个文件使用了前一个文件的索引 d_path:f_identity_qa_cn_re.json, train_dataset_size:198991, eval_dataset:1009 d_path:f_multiturn_cn_69k.json, train_dataset_size:69318, eval_dataset:336 d_path:f_ver_qa_cn_28k.json, train_dataset_size:28140, eval_dataset:166	索引使用基座模型名字，训练数据文件名字区别开
26	单条数据不补齐到 max_seq_len, batch内部数据补齐到batch内最大长度	batch内所有数据补齐到max_seq_len		提高计算效率，大多数长度相对很短
27	Moss-13B loss异常			重新设置lr, warmup...
27	Moss-13B loss异常		重新设置lr, warmup...
28	GPU 卡住			偶发性卡住，多机多卡常遇到的事读数据阶段，0号GPU卡住不动.
29	OSError: [Errno 122] Disk quota exceeded	root@master:~# quota -uvs user_name Disk quotas for user user_name (uid 1006): Filesystem space quota limit grace files quota limit grace /dev/sda1 2862G* 2852G 2862G 6days 96582 2900k 3000k		快速存储系统写满加系统监控，写到80%提前预警
30	读数据，机器之间速度差别较大	1. gnode03机器： @master:/$ tail -f training.log 3% ▎ 32285/1087101 [01:29<48:05, 365.57it/s] 2.gnode04机器： @master:/$ tail -f training.log 6% ▌ 63310/1087101 [02:53<45:11, 377.51it/s] $3.gnode06机器: @master:~/$ tail -f training.log 19% █▉ 211851/1087101 [03:36<15:01, 970.99it/s]	读取一个108万条数据的文件，有的机器节点耗时50分钟；有的机器节点耗时18分钟。	快速文件系统配置存在问题
32	-	-	-	效果优化
33	-	-	-	效果优化
34	-	-	-	效果优化
35	-	-	-	效果优化
36	启动训练时不提供具体数据文件，只提供数据集目录，自动读取目录下的所有数据文件	提供每一个训练数据文件名字	提供训练数据集目录	预训练数据是由很多小文件组成的，不方便在启动脚本里加入许多文件名字
37	load_dataset不能读大文件	pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2572789185	内存映射，流式读取	from datasets import load_dataset
38	数据量超出某个临界点OOM， 1T内存都被爆满		加载数据之前，内存消耗： 1-ToatlMem:1007, UsedMem:211, FreeMem:768 加载数据之后，内存消耗： 2-ToatlMem:1007, UsedMem:898, FreeMem:42 此时，空闲内存42G，开始进入 deepspeed.initialize（），使用的是ZeRO 2优化	原模型训练流程拆分成两部分： 1，生成内存映射文件（MemapGen）：加载原始数据集，tokenizer后padding到max_seq_len大小，写入内存映射文件；内存映射文件包括三个，分别是input_ids文件，attention_mask文件和 labels文件，还有一个config json配置文件。 2，模型训练加载config文件和三个内存映射文件，跑起来... 优化点：组装batch时，随机读取整个映射文件，速度巨慢；MemapGen时随机shuffle好文件，加载后按顺序组装batch。
39	OSError: [Errno 122] Disk quota exceeded	$ quota -uvs user_name	空间满，所有训练都会停止	增加硬盘存储监控，空间使用85%时提前预警，主动发现隐患点

iDeepSpeedChat 实训可视化图

Open-source of LLMs

No.	Projects	URL	Comments
1	LLaMA	https://github.com/facebookresearch/llama	模型结构transformer block:
2	OpenChatKit	https://github.com/togethercomputer/OpenChatKit	基于GPT-NeoX-20B的微调版本，200亿参数，48层，单机八卡，每卡六层网络，每一层的模型结构：
3	Open-Assistant	https://github.com/LAION-AI/Open-Assistant	12B或者LLAMA-7B两个版本，Open Assistant 全流程训练细节（GPT3+RL）,https://zhuanlan.zhihu.com/p/609003237
4	ChatGLM-6B	https://github.com/THUDM/ChatGLM-6B	ChatGLM-6B 是一个开源的、支持中英双语的对话语言模型，基于 General Language Model (GLM) 架构，具有 62 亿参数。可以作为很好的基础模型，在此之上做二次研发，在特定垂直领域。没有放出源代码，只有训练好的模型。
5	GLM-130B	https://github.com/THUDM/GLM-130B/	1300亿参数的中/英文大模型，没有放出源代码，只有训练好的模型
6	Alpaca 7B	https://crfm.stanford.edu/2023/03/13/alpaca.html	A Strong Open-Source Instruction-Following Model，a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations.
7	Claude	用户可以通过邮箱等信息注册申请试用, 产品地址：https://www.anthropic.com/product, 申请地址：https://www.anthropic.com/earlyaccess, API说明: https://console.anthropic.com/docs/api	两个版本的 Claude：Claude 和 Claude Instant。 Claude 是最先进的高性能模型，而 Claude Instant 是更轻、更便宜、更快的选择。
8	LLama/ChatLLama	https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/chatllama	中文支持不好，有全套的SFT，RLHF训练过程
9	chatglm-6B_finetuning	https://github.com/ssbuild/chatglm_finetuning	1,chatGLM-6B的微调版本，正在补充RLHF代码，陆续放出来；28层网络，每一层的模型结构：; 2，两种微调方式：LoRA微调和SFT微调，28层网络，指令数据5K，单机8卡，A100，80G显存，batch size 8, epoch 1或2（有生成重复问题），大约20分钟内完成； 3, 借鉴 Colossal-AI/Open-Assistant的强化学习代码（PPO，PPO-ptx算法），Colossal-AI可以迁移过来，被实践过。 4，Reward model，可选较多，直接基于GLM-6B模型微调一个Reward model。难点就是训练数据；GPT3.5使用了33K的人工标注数据训练 Reword model。每个问题，配置四个答案ABCD，人工从好到差排序比如B>A>D>C，排序后的数据微调Reward model。单机8卡，A100， 80G，train_batch_size=4, max_seq_len设置成512，才可以跑50K级别的微调数据集，这份代码感觉有点疑问，需要优化的地方挺多的
10	ChatGLM-Tuning	https://github.com/mymusise/ChatGLM-Tuning	ChatGLM-6B的又一个微调版本
11	中文语言模型骆驼 (Luotuo)	https://github.com/LC1332/Chinese-alpaca-lora	基于 LLaMA、Stanford Alpaca、Alpaca LoRA、Japanese-Alpaca-LoRA 等完成，单卡就能完成训练部署
12	Alpaca-COT数据集	https://github.com/PhoebusSi/Alpaca-CoT	思维链（CoT）数据集，增强大语言模型的推理能力
13	Bloom	https://huggingface.co/bigscience/bloom	训练和代码比较全
14	中文LLaMA&Alpaca大语言模型	https://github.com/ymcui/Chinese-LLaMA-Alpaca	在原版LLaMA（7B和13B）的基础上扩充了中文词表并使用了中文数据进行二次预训练，进一步提升了中文基础语义理解能力。同时，在中文LLaMA的基础上，本项目使用了中文指令数据进行指令精调，显著提升了模型对指令的理解和执行能力。
15	Colossal-AI/ColossalChat	https://github.com/hpcaitech/ColossalAI	训练和代码比较全，包括 RLHF 训练代码；以 LLaMA 为基础预训练模型；开源了7B和13B两种模型；
16	Cerebras-GPT七个版本	官网地址：https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models GPT地址:https://www.cerebras.net/cerebras-gpt Hugging Face地址:https://huggingface.co/cerebras	七个参数版本：1.16亿、2.56亿、5.9亿、13亿、27亿、67亿和130亿参数, 基于GPT的生成人工智能大型语言模型
17	BloombergGPT (金融领域)	https://arxiv.org/abs/2303.17564	BloombergGPT的训练数据库名为FINPILE，构建迄今为止最大的特定领域数据集, 由一系列英文金融信息组成，包括新闻、文件、新闻稿、网络爬取的金融文件以及提取到的社交媒体消息。训练专门用于金融领域的LLM，拥有500亿参数的语言模型。
18	dolly-v1-6b	https://github.com/databrickslabs/dolly	1, fine-tuned on a ~52K instruction (Self-Instruct从 ChatGPT自动获取)； 2，deepspeed ZeRo 3加速训练; 3.可借鉴的：deepspeed ZeRo 3加速训练部分；
19	ChatDoctor	https://github.com/Kent0n-Li/ChatDoctor	医疗领域对话模型，基于LLaMA-7B微调的大模型，经过四轮微调：第一轮微调：羊驼的52K instruction-following 数据 ;第二轮微调：患者和医生之间的5K对话数据集（ChatGPT GenMedGPT-5k和疾病数据库生成）；第三轮微调：患者和医生之间的真实对话（HealthCareMagic-200k）；第四轮微调：患者和医生之间的真实对话（icliniq-26k）.
20	开源中文对话大模型BELLE	https://github.com/LianjiaTech/BELLE	BELLE-7B（基于 BLOOMZ-7B1-mt 微调） BELLE-13B的感觉还行
21	InstructGLM	https://github.com/yanqiangmiffy/InstructGLM	基于ChatGLM-6B+LoRA在指令数据集上进行微调；截止4月4号下午，InstructGLM存在以下缺点：多卡不支持，原作者在回答issues时也确认了；社区不活跃，两周不更新代码，坑反馈的太少（才11条），deepspeed没有；这是我用三块卡跑，卡负载不均衡：
22	Cerebras-GPT	https://huggingface.co/cerebras/Cerebras-GPT-13B	参数量级130亿，大小比肩最近Meta开放的LLaMA-13B，数据集、模型权重和计算优化训练，全部开源。可商用！
23	Baize (加利福尼亚大学, 基于 LLaMA 的微调)	https://github.com/project-baize/baize-chatbot	数据集生成: 让 ChatGPT 与自己进行对话，模拟用户和AI机器人的回复」。这个生成的语料数据集是在多轮对话的背景下训练和评估聊天模型的宝贵资源。此外，通过指定种子数据集，可以从特定领域进行采样，并微调聊天模型以专门针对特定领域，例如医疗保健或金融。 Parameter-efficient tuning，输入序列的最大长度设置为512，LoRA中的秩k设置为8，使用8位整数格式 (int8) ，Adam 优化器」更新LoRA 参数，batch size为64，learning rate为2e-4、1e-4和 5e-5，可训练的LoRA参数在 NVIDIA A100-80GB GPU 上微调了1个 epoch。
24	Open-Llama	https://github.com/s-JoL/Open-Llama	Open-Llama是一个开源项目，提供了一整套用于构建大型语言模型的训练流程，从数据集准备到分词、预训练、指令调优，以及强化学习技术 RLHF。采用FastChat项目相同方法测评Open-Llama的效果和GPT3.5的效果对比，经过测试在中文问题上可以达到GPT3.5 84%的水平。
25	DeepSpeed-Chat	https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat https://github.com/microsoft/DeepSpeedExamples	微调框架：包括指令微调（SFT），Reward model 和强化学习对齐意图（RLHF）
26	fairseq	https://github.com/facebookresearch/fairseq	FaceBook开源的大语言模型预训练框架
27	metaseq	https://github.com/facebookresearch/metaseq	FaceBook开源的大语言模型预训练模型框架，基于fairseq的新版本
28	MiniGPT-4	https://github.com/Vision-CAIR/MiniGPT-4	多模态大模型，基于 BLIP-2 和 Vicuna（LLaMA-7B基座）, 阿卜杜拉国王科技大学
29	moss (复旦大学)	https://github.com/OpenLMLab/MOSS https://huggingface.co/models?other=moss	moss-13B开源了，重要贡献是提供了一个纯基座
30	红睡衣(RedPajama)开源计划	https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T 预处理仓库:https://github.com/togethercomputer/RedPajama-Data	红睡衣开源计划总共包括三部分： 1. 高质量、大规模、高覆盖度的预训练数据集； 2. 在预训练数据集上训练出的基础模型； 3. 指令调优数据集和模型，比基本模型更安全、可靠。 Ontocord.AI，苏黎世联邦理工学院DS3Lab，斯坦福CRFM，斯坦福Hazy Research 和蒙特利尔学习算法研究所的开源计划，旨在生成可复现、完全开放、最先进的语言模型，即从零一直开源到ChatGPT！。
31	Panda 中文开源大语言模型	https://github.com/dandelionsllm/pandallm	基于Llama-7B、-13B、-33B和-65B进行了中文领域的持续预训练，在中文基准测试中表现优异，远超同等类型的中文语言模型，Panda的模型和训练所用中文数据集将以开源形式发布，任何人都可以免费使用和参与开发。
32	BELLE （LLaMA，链家）	https://github.com/LianjiaTech/BELLE	BELLE-LLaMA-EXT-13B，在LLaMA-13B的基础上扩展中文词表，并在400万高质量的对话数据上进行训练。
33	Linly-Chinese-LLaMA	https://github.com/CVI-SZU/Linly	LLaMA-7B/13B基础上，中文二次预训练，上下文长度2048
34	CAMEL	https://github.com/camel-ai/camel	https://www.camel-ai.org/chat
35	Ziya-LLaMA-13B-v1	https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1	姜子牙通用大模型V1是基于LLaMa的130亿参数的大规模预训练模型
36	悟道·天鹰（Aquila）	https://github.com/FlagAI-Open/FlagAI/tree/master/examples/Aquila/Aquila-chat

ColossalAI 的性能测试

1， ZeRO 2的性能，tflops 约为251

2, ZeRO 2和3的性能对比

Model	ZeRO	GPU数量	Bs	Seq len	Gpu mem	Gpu Usage	Iter	Tflops	TGS (tokens per gpu per second)
LLaMA-7B	zero2	2	50	2048	90%	60%	25s	250	4096
LLaMA-7B	zero3	2	76	2048	97%	80%	30s	300	5188

3, 黄色曲线是ZeRO3，绿色曲线是ZeRO2

北京邮电大学王小捷教授 ChatGPT 讲座分享

https://www.bilibili.com/video/BV1G24y187yx/?buvid=ZB476BB0B8710E3C4F548C7C2778AA1427C6&is_story_h5=false&mid=AdBmq4Rn7y73B2EmgVj16A%3D%3D&p=1&plat_id=114&share_from=ugc&share_medium=iphone&share_plat=ios&share_session_id=5BB03E0F-3FED-48AF-A5FE-7F3E52513D99&share_source=WEIXIN&share_tag=s_i&timestamp=1677718075&unique_k=lk400UP&up_id=354740423

致谢

For Tasks:

Click tags to check more tools for each tasks

train large language models clean pre-training data improve data quality align security work construct sft datasets

For Jobs:

data scientist machine learning engineer ai researcher nlp specialist data engineer

Alternative AI tools for LLMs

Similar Open Source Tools

LLMs

github

: 97

awesome-ai-painting

This repository, named 'awesome-ai-painting', is a comprehensive collection of resources related to AI painting. It is curated by a user named 秋风, who is an AI painting enthusiast with a background in the AIGC industry. The repository aims to help more people learn AI painting and also documents the user's goal of creating 100 AI products, with current progress at 4/100. The repository includes information on various AI painting products, tutorials, tools, and models, providing a valuable resource for individuals interested in AI painting and related technologies.

github

: 11.0k

Firefly

Firefly is an open-source large model training project that supports pre-training, fine-tuning, and DPO of mainstream large models. It includes models like Llama3, Gemma, Qwen1.5, MiniCPM, Llama, InternLM, Baichuan, ChatGLM, Yi, Deepseek, Qwen, Orion, Ziya, Xverse, Mistral, Mixtral-8x7B, Zephyr, Vicuna, Bloom, etc. The project supports full-parameter training, LoRA, QLoRA efficient training, and various tasks such as pre-training, SFT, and DPO. Suitable for users with limited training resources, QLoRA is recommended for fine-tuning instructions. The project has achieved good results on the Open LLM Leaderboard with QLoRA training process validation. The latest version has significant updates and adaptations for different chat model templates.

github

: 4.8k

Awesome-Interpretability-in-Large-Language-Models

This repository is a collection of resources focused on interpretability in large language models (LLMs). It aims to help beginners get started in the area and keep researchers updated on the latest progress. It includes libraries, blogs, tutorials, forums, tools, programs, papers, and more related to interpretability in LLMs.

github

: 230

oumi

Oumi is an open-source platform for building state-of-the-art foundation models, offering tools for data preparation, training, evaluation, and deployment. It supports training and fine-tuning models with various parameters, working with text and multimodal models, synthesizing and curating training data, deploying models efficiently, evaluating models comprehensively, and running on different platforms. Oumi provides a consistent API, reliability, and flexibility for research purposes.

github

: 8.5k

chat-master

ChatMASTER is a self-built backend conversation service based on AI large model APIs, supporting synchronous and streaming responses with perfect printer effects. It supports switching between mainstream models such as DeepSeek, Kimi, Doubao, OpenAI, Claude3, Yiyan, Tongyi, Xinghuo, ChatGLM, Shusheng, and more. It also supports loading local models and knowledge bases using Ollama and Langchain, as well as online API interfaces like Coze and Gitee AI. The project includes Java server-side, web-side, mobile-side, and management background configuration. It provides various assistant types for prompt output and allows creating custom assistant templates in the management background. The project uses technologies like Spring Boot, Spring Security + JWT, Mybatis-Plus, Lombok, Mysql & Redis, with easy-to-understand code and comprehensive permission control using JWT authentication system for multi-terminal support.

github

: 99

Awesome-Tabular-LLMs

This repository is a collection of papers on Tabular Large Language Models (LLMs) specialized for processing tabular data. It includes surveys, models, and applications related to table understanding tasks such as Table Question Answering, Table-to-Text, Text-to-SQL, and more. The repository categorizes the papers based on key ideas and provides insights into the advancements in using LLMs for processing diverse tables and fulfilling various tabular tasks based on natural language instructions.

github

: 151

Tutorial

The Bookworm·Puyu large model training camp aims to promote the implementation of large models in more industries and provide developers with a more efficient platform for learning the development and application of large models. Within two weeks, you will learn the entire process of fine-tuning, deploying, and evaluating large models.

github

: 1.6k

inference-speed-tests

This repository contains inference speed tests on Local Large Language Models on various devices. It provides results for different models tested on Macbook Pro and Mac Studio. Users can contribute their own results by running models with the provided prompt and adding the tokens-per-second output. Note that the results are not verified.

github

: 54

InternVL

InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM. It is a vision-language foundation model that can perform various tasks, including: **Visual Perception** - Linear-Probe Image Classification - Semantic Segmentation - Zero-Shot Image Classification - Multilingual Zero-Shot Image Classification - Zero-Shot Video Classification **Cross-Modal Retrieval** - English Zero-Shot Image-Text Retrieval - Chinese Zero-Shot Image-Text Retrieval - Multilingual Zero-Shot Image-Text Retrieval on XTD **Multimodal Dialogue** - Zero-Shot Image Captioning - Multimodal Benchmarks with Frozen LLM - Multimodal Benchmarks with Trainable LLM - Tiny LVLM InternVL has been shown to achieve state-of-the-art results on a variety of benchmarks. For example, on the MMMU image classification benchmark, InternVL achieves a top-1 accuracy of 51.6%, which is higher than GPT-4V and Gemini Pro. On the DocVQA question answering benchmark, InternVL achieves a score of 82.2%, which is also higher than GPT-4V and Gemini Pro. InternVL is open-sourced and available on Hugging Face. It can be used for a variety of applications, including image classification, object detection, semantic segmentation, image captioning, and question answering.

github

: 6.5k

AstrBot

AstrBot is a powerful and versatile tool that leverages the capabilities of large language models (LLMs) like GPT-3, GPT-3.5, and GPT-4 to enhance communication and automate tasks. It seamlessly integrates with popular messaging platforms such as QQ, QQ Channel, and Telegram, enabling users to harness the power of AI within their daily conversations and workflows.

github

: 6.6k

TRACE

TRACE is a temporal grounding video model that utilizes causal event modeling to capture videos' inherent structure. It presents a task-interleaved video LLM model tailored for sequential encoding/decoding of timestamps, salient scores, and textual captions. The project includes various model checkpoints for different stages and fine-tuning on specific datasets. It provides evaluation codes for different tasks like VTG, MVBench, and VideoMME. The repository also offers annotation files and links to raw videos preparation projects. Users can train the model on different tasks and evaluate the performance based on metrics like CIDER, METEOR, SODA_c, F1, mAP, Hit@1, etc. TRACE has been enhanced with trace-retrieval and trace-uni models, showing improved performance on dense video captioning and general video understanding tasks.

github

: 54

2024-free-good-airport-vpn

github

: 66

MMOS

MMOS (Mix of Minimal Optimal Sets) is a dataset designed for math reasoning tasks, offering higher performance and lower construction costs. It includes various models and data subsets for tasks like arithmetic reasoning and math word problem solving. The dataset is used to identify minimal optimal sets through reasoning paths and statistical analysis, with a focus on QA-pairs generated from open-source datasets. MMOS also provides an auto problem generator for testing model robustness and scripts for training and inference.

github

: 61

Awesome-RL-for-LRMs

This repository contains a collection of awesome resources for reinforcement learning in language models. It includes tutorials, code implementations, research papers, and tools to help researchers and practitioners explore and apply reinforcement learning techniques in natural language processing tasks. Whether you are a beginner or an expert in the field, this repository aims to provide valuable insights and guidance to enhance your understanding and implementation of reinforcement learning in language models.

github

: 1.4k

phoenix

Phoenix is a tool that provides MLOps and LLMOps insights at lightning speed with zero-config observability. It offers a notebook-first experience for monitoring models and LLM Applications by providing LLM Traces, LLM Evals, Embedding Analysis, RAG Analysis, and Structured Data Analysis. Users can trace through the execution of LLM Applications, evaluate generative models, explore embedding point-clouds, visualize generative application's search and retrieval process, and statistically analyze structured data. Phoenix is designed to help users troubleshoot problems related to retrieval, tool execution, relevance, toxicity, drift, and performance degradation.

github

: 7.1k

For similar tasks

LLMs

github

: 97

Reflection_Tuning

Reflection-Tuning is a project focused on improving the quality of instruction-tuning data through a reflection-based method. It introduces Selective Reflection-Tuning, where the student model can decide whether to accept the improvements made by the teacher model. The project aims to generate high-quality instruction-response pairs by defining specific criteria for the oracle model to follow and respond to. It also evaluates the efficacy and relevance of instruction-response pairs using the r-IFD metric. The project provides code for reflection and selection processes, along with data and model weights for both V1 and V2 methods.

github

: 124

ProX

ProX is a lm-based data refinement framework that automates the process of cleaning and improving data used in pre-training large language models. It offers better performance, domain flexibility, efficiency, and cost-effectiveness compared to traditional methods. The framework has been shown to improve model performance by over 2% and boost accuracy by up to 20% in tasks like math. ProX is designed to refine data at scale without the need for manual adjustments, making it a valuable tool for data preprocessing in natural language processing tasks.

github

: 164

ml-engineering

This repository provides a comprehensive collection of methodologies, tools, and step-by-step instructions for successful training of large language models (LLMs) and multi-modal models. It is a technical resource suitable for LLM/VLM training engineers and operators, containing numerous scripts and copy-n-paste commands to facilitate quick problem-solving. The repository is an ongoing compilation of the author's experiences training BLOOM-176B and IDEFICS-80B models, and currently focuses on the development and training of Retrieval Augmented Generation (RAG) models at Contextual.AI. The content is organized into six parts: Insights, Hardware, Orchestration, Training, Development, and Miscellaneous. It includes key comparison tables for high-end accelerators and networks, as well as shortcuts to frequently needed tools and guides. The repository is open to contributions and discussions, and is licensed under Attribution-ShareAlike 4.0 International.

github

: 15.1k

distributed-llama

Distributed Llama is a tool that allows you to run large language models (LLMs) on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage. It uses TCP sockets to synchronize the state of the neural network, and you can easily configure your AI cluster by using a home router. Distributed Llama supports models such as Llama 2 (7B, 13B, 70B) chat and non-chat versions, Llama 3, and Grok-1 (314B).

github

: 2.7k

Awesome-LLMs-for-Video-Understanding

Awesome-LLMs-for-Video-Understanding is a repository dedicated to exploring Video Understanding with Large Language Models. It provides a comprehensive survey of the field, covering models, pretraining, instruction tuning, and hybrid methods. The repository also includes information on tasks, datasets, and benchmarks related to video understanding. Contributors are encouraged to add new papers, projects, and materials to enhance the repository.

github

: 1.8k

Awesome-LLM

Awesome-LLM is a curated list of resources related to large language models, focusing on papers, projects, frameworks, tools, tutorials, courses, opinions, and other useful resources in the field. It covers trending LLM projects, milestone papers, other papers, open LLM projects, LLM training frameworks, LLM evaluation frameworks, tools for deploying LLM, prompting libraries & tools, tutorials, courses, books, and opinions. The repository provides a comprehensive overview of the latest advancements and resources in the field of large language models.

github

: 22.1k

MotionLLM

MotionLLM is a framework for human behavior understanding that leverages Large Language Models (LLMs) to jointly model videos and motion sequences. It provides a unified training strategy, dataset MoVid, and MoVid-Bench for evaluating human behavior comprehension. The framework excels in captioning, spatial-temporal comprehension, and reasoning abilities.

github

: 212

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 980

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.9k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 32.1k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675

LLMs

README:

中文领域大语言模型实操技术全栈

Scaling Law for SFT & DPO

领域LLMs的数据模型 for SFT & DPO

领域LLMs的有效训练方案 for SFT & DPO

预训练、SFT和DPO偏好对齐代码框架

SFT代码开源

偏好训练 Post-Training with Direct Preference Optimization

预训练数据清洗源代码和高并发框架

SFT 数据集清洗和数据质量提升代码框架

中文大语言模型的安全对齐工作，用于安全审核评估

开源sft数据集构造

1. 大规模COT高中试题数据集，命名为“cn-sft-exams-highSchool-1M.jsonl”

2. SFT数据质量优化

2. 行政职业能力测验题数据集，命名为“cn-sft-CS-APAT-30K.jsonl”

落地大语言模型LLM，关键问题是在某个垂直领域，如何构造高质量指令数据集？有方法论吗？

从0到1预训练大语言模型

论文引用请参考

sft微调训练和强化学习训练

DPO 偏好数据对齐

DPO偏序数据集构造

DPO训练，关键指标图示(一部分)：

预训练数据收集和清洗

什么是质量高？

什么是多样性？

什么是数量大？

交流群

基于DeepSpeedChat改造，用于RLHF训练的框架

iDeepSpeedChat 实训可视化图

Open-source of LLMs

ColossalAI 的性能测试

北京邮电大学 王小捷教授 ChatGPT 讲座分享

致谢

For Tasks:

For Jobs:

Alternative AI tools for LLMs

Similar Open Source Tools

LLMs

awesome-ai-painting

Firefly

Awesome-Interpretability-in-Large-Language-Models

oumi

chat-master

Awesome-Tabular-LLMs

Tutorial

inference-speed-tests

InternVL

AstrBot

TRACE

2024-free-good-airport-vpn

MMOS

Awesome-RL-for-LRMs

phoenix

For similar tasks

LLMs

Reflection_Tuning

ProX

ml-engineering

distributed-llama

Awesome-LLMs-for-Video-Understanding

Awesome-LLM

MotionLLM

For similar jobs

weave

LLMStack

VisionCraft

kaito

PyRIT

tabby

spear

Magick

北京邮电大学王小捷教授 ChatGPT 讲座分享