Steel-LLM
Train a Chinese LLM From 0 by Personal
Stars: 144
Steel-LLM is a project to pre-train a large Chinese language model from scratch using over 1T of data to achieve a parameter size of around 1B, similar to TinyLlama. The project aims to share the entire process including data collection, data processing, pre-training framework selection, model design, and open-source all the code. The goal is to enable reproducibility of the work even with limited resources. The name 'Steel' is inspired by a band '万能青年旅店' and signifies the desire to create a strong model despite limited conditions. The project involves continuous data collection of various cultural elements, trivia, lyrics, niche literature, and personal secrets to train the LLM. The ultimate aim is to fill the model with diverse data and leave room for individual input, fostering collaboration among users.
README:
Steel-LLM是一个从零开始预训练中文大模型的项目。我们的目标是使用1T+的数据预训练一个1B左右参数量的中文LLM,对标TinyLlama。项目持续更新,维持3个月+。我们会分享数据收集、数据处理、预训练框架选择、模型设计等全过程,并开源全部代码。让每个人在有8~几十张卡的情况下都能复现我们的工作。
🤗 Hugging Face      📑 Blog
"Steel(钢)"取名灵感来源于华北平原一只优秀的乐队“万能青年旅店(万青)”。乐队在做一专的时候条件有限,自称是在“土法炼钢”,但却是一张神专。我们训练LLM的条件同样有限,但也希望能炼出好“钢”来。
[2024/9/2] HuggingFace更新了480k、660k、720k、980k、1060k(最后一个checkpoint)step的checkpoint。
[2024/8/18] 预训练已经完成,后续进行微调以及评测
[2024/7/18] 使用8*H800继续训练,wandb:https://wandb.ai/steel-llm-lab/lightning_logs/reports/Untitled-Report--Vmlldzo4NzI1MTQz
[2024/6/30] 放出预训练200k个step的checkpoint,huggingface链接
[2024/5/21] 模型开启正式训练,后续不定期放出checkpoint。
[2024/5/19] 基于Qwen1.5完成模型修改,模型大小1.12B:
- FFN层使用softmax moe,相同参数量下有更高的训练速度
- 使用双层的SwiGLU
相关博客:https://zhuanlan.zhihu.com/p/700395878
[2024/5/5] 预训练程序修改相关的博客:https://zhuanlan.zhihu.com/p/694223107
[2024/4/24] 完成训练程序改进:兼容Hugginface格式模型、支持数据断点续训、支持追加新的数据
[2024/4/14] 完成数据收集与处理,生成预训练程序所需要的bin文件。更新数据收集与处理相关的博客:https://zhuanlan.zhihu.com/p/687338497
zhanshijin的知乎:https://www.zhihu.com/people/zhan-shi-jin-27
lishu14的知乎:https://www.zhihu.com/people/a-xun-58-5
使用的数据集和链接如下所示,更详细的介绍请看此篇文章
- Skywork/Skypile-150B数据集
- wanjuan1.0(nlp部分)
- 中文维基过滤数据
- 百度百科数据
- 百度百科问答数据
- 知乎问答数据
- BELLE对话数据
- moss项目对话数据
- firefly1.1M
- starcoder
- 源数据:针对三类数据进行格式统一的转化处理。
- 简单文本:百度百科(title和各段落需要手动合并)、中文维基
- 对话(含单轮与多轮):百度百科问答数据、BELLE对话数据(BELLE_3_5M)、moss项目对话数据、知乎问答数据
- 任务:BELLE任务数据(BELLE_2_5M)、firefly1.1M
- 代码数据:starcode
- 目标格式:
{"text": "asdfasdf..."}
,文件保存为.jsonl类型。 - 运行方式:
python data/pretrain_data_prepare/step1_data_process.py
我们使用data-juicer处理文本时,不改变数据格式。
-
运行方式:
sh data/pretrain_data_prepare/step2/run_step2.sh
-
选用的文本处理算子
算子 | 描述 |
---|---|
chinese_convert_mapper | 用于在繁体中文、简体中文和日文汉字之间进行转换(借助 opencc) |
clean_email_mapper | 删除邮箱信息 |
clean_html_mapper | 删除 HTML 标签并返回所有节点的纯文本 |
clean_ip_mapper | 删除 IP 地址 |
clean_links_mapper | 删除链接,例如以 http 或 ftp 开头的 |
clean_copyright_mapper | 删除代码文件开头的版权声明 (:warning: 必须包含单词 copyright) |
expand_macro_mapper | 扩展通常在 TeX 文档顶部定义的宏 |
fix_unicode_mapper | 修复损坏的 Unicode(借助 ftfy) |
punctuation_normalization_mapper | 将各种 Unicode 标点符号标准化为其 ASCII 等效项 |
remove_repeat_sentences_mapper | 删除样本中的重复句子 |
remove_specific_chars_mapper | 删除样本中的特殊字符(用户自定义) |
whitespace_normalization_mapper | 将各类空格归一转换为英语空格 |
alphanumeric_filter | 保留字母数字比例在指定范围内的样本 |
average_line_length_filter | 保留平均行长度在指定范围内的样本 |
character_repetition_filter | 保留 char-level n-gram 重复比率在指定范围内的样本 |
maximum_line_length_filter | 保留最大行长度在指定范围内的样本 |
perplexity_filter | 保留困惑度低于指定阈值的样本 |
special_characters_filter | 保留 special-char 比率的在指定范围内的样本 |
text_length_filter | 保留总文本长度在指定范围内的样本 |
word_repetition_filter | 保留 word-level n-gram 重复比率在指定范围内的样本 |
document_simhash_deduplicator | 使用 SimHash 在文档级别对样本去重 |
- 选用的代码处理算子
算子 | 描述 |
---|---|
clean_copyright_mapper | 删除代码文件开头的版权声明 (:warning: 必须包含单词 copyright) |
clean_email_mapper | 删除邮箱信息 |
clean_links_mapper | 删除链接,例如以 http 或 ftp 开头的 |
fix_unicode_mapper | 修复损坏的 Unicode(借助 ftfy) |
punctuation_normalization_mapper | 将各种 Unicode 标点符号标准化为其 ASCII 等效项 |
alphanumeric_filter | 保留字母数字比例在指定范围内的样本 |
average_line_length_filter | 保留平均行长度在指定范围内的样本 |
character_repetition_filter | 保留 char-level n-gram 重复比率在指定范围内的样本 |
maximum_line_length_filter | 保留最大行长度在指定范围内的样本 |
text_length_filter | 保留总文本长度在指定范围内的样本 |
word_num_filter | 保留字数在指定范围内的样本 |
word_repetition_filter | 保留 word-level n-gram 重复比率在指定范围内的样本 |
document_simhash_deduplicator | 使用 SimHash 在文档级别对样本去重 |
需要先在代码中修改filename_sets,指定数据路径:
python pretrain_modify_from_TinyLlama/scripts/prepare_steel_llm_data.py
输入数据格式为:包含'text'字段的jsonl文件
不单独训练tokenizer,使用Qwen/Qwen1.5-MoE-A2.7B-Chat的tokenizer
基于Qwen1.5模型,进行了如下改动:
- FFN层使用softmax moe,相同参数量下有更高的训练速度
- 使用双层的SwiGLU
基于TinyLlama预训练程序进行如下改进:
- 兼容HuggingFace格式的模型
- 加载checkpoint时,完全恢复数据训练的进度
- 数据一致性检测
- 在不影响已训练数据的情况下,在数据集中追加新的数据
启动预训练:
python Steel-LLM/pretrain_modify_from_TinyLlama/pretrain/pretrain_steel_llm.py
GPU:8* H800 80G
GPU:8* A100 80G
硬盘:4TB
欢迎加入交流群,人数已超过200,添加微信入群:a1843450905
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for Steel-LLM
Similar Open Source Tools
Steel-LLM
Steel-LLM is a project to pre-train a large Chinese language model from scratch using over 1T of data to achieve a parameter size of around 1B, similar to TinyLlama. The project aims to share the entire process including data collection, data processing, pre-training framework selection, model design, and open-source all the code. The goal is to enable reproducibility of the work even with limited resources. The name 'Steel' is inspired by a band '万能青年旅店' and signifies the desire to create a strong model despite limited conditions. The project involves continuous data collection of various cultural elements, trivia, lyrics, niche literature, and personal secrets to train the LLM. The ultimate aim is to fill the model with diverse data and leave room for individual input, fostering collaboration among users.
TensorRT-LLM
TensorRT-LLM is an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM contains components to create Python and C++ runtimes that execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server; a production-quality system to serve LLMs. Models built with TensorRT-LLM can be executed on a wide range of configurations going from a single GPU to multiple nodes with multiple GPUs (using Tensor Parallelism and/or Pipeline Parallelism).
VideoLLaMA2
VideoLLaMA 2 is a project focused on advancing spatial-temporal modeling and audio understanding in video-LLMs. It provides tools for multi-choice video QA, open-ended video QA, and video captioning. The project offers model zoo with different configurations for visual encoder and language decoder. It includes training and evaluation guides, as well as inference capabilities for video and image processing. The project also features a demo setup for running a video-based Large Language Model web demonstration.
ST-LLM
ST-LLM is a temporal-sensitive video large language model that incorporates joint spatial-temporal modeling, dynamic masking strategy, and global-local input module for effective video understanding. It has achieved state-of-the-art results on various video benchmarks. The repository provides code and weights for the model, along with demo scripts for easy usage. Users can train, validate, and use the model for tasks like video description, action identification, and reasoning.
aidea
AIdea is an app that integrates mainstream large language models and drawing models, developed using Flutter. The code is completely open-source and supports various functions such as GPT-3.5, GPT-4 from OpenAI, Claude instant, Claude 2.1 from Anthropic, Gemini Pro and visual language models from Google, as well as various Chinese and open-source models. It also supports features like text-to-image, super-resolution, coloring black and white images, artistic fonts, artistic QR codes, and more.
Bert-VITS2
Bert-VITS2 is a repository that provides a backbone with multilingual BERT for text-to-speech (TTS) applications. It offers an alternative to BV2/GSV projects and is inspired by the MassTTS project. Users can refer to the code to learn how to train models for TTS. The project is not maintained actively in the short term. It is not to be used for any purposes that violate the laws of the People's Republic of China, and strictly prohibits any political-related use.
AIBotPublic
AIBotPublic is an open-source version of AIBotPro, a comprehensive AI tool that provides various features such as knowledge base construction, AI drawing, API hosting, and more. It supports custom plugins and parallel processing of multiple files. The tool is built using bootstrap4 for the frontend, .NET6.0 for the backend, and utilizes technologies like SqlServer, Redis, and Milvus for database and vector database functionalities. It integrates third-party dependencies like Baidu AI OCR, Milvus C# SDK, Google Search, and more to enhance its capabilities.
ruoyi-ai
ruoyi-ai is a platform built on top of ruoyi-plus to implement AI chat and drawing functionalities on the backend. The project is completely open source and free. The backend management interface uses elementUI, while the server side is built using Java 17 and SpringBoot 3.X. It supports various AI models such as ChatGPT4, Dall-E-3, ChatGPT-4-All, voice cloning based on GPT-SoVITS, GPTS, and MidJourney. Additionally, it supports WeChat mini programs, personal QR code real-time payments, monitoring and AI auto-reply in live streaming rooms like Douyu and Bilibili, and personal WeChat integration with ChatGPT. The platform also includes features like private knowledge base management and provides various demo interfaces for different platforms such as mobile, web, and PC.
TEN-Agent
TEN Agent is an open-source multimodal agent powered by the world’s first real-time multimodal framework, TEN Framework. It offers high-performance real-time multimodal interactions, multi-language and multi-platform support, edge-cloud integration, flexibility beyond model limitations, and real-time agent state management. Users can easily build complex AI applications through drag-and-drop programming, integrating audio-visual tools, databases, RAG, and more.
UMOE-Scaling-Unified-Multimodal-LLMs
Uni-MoE is a MoE-based unified multimodal model that can handle diverse modalities including audio, speech, image, text, and video. The project focuses on scaling Unified Multimodal LLMs with a Mixture of Experts framework. It offers enhanced functionality for training across multiple nodes and GPUs, as well as parallel processing at both the expert and modality levels. The model architecture involves three training stages: building connectors for multimodal understanding, developing modality-specific experts, and incorporating multiple trained experts into LLMs using the LoRA technique on mixed multimodal data. The tool provides instructions for installation, weights organization, inference, training, and evaluation on various datasets.
Open-Sora-Plan
Open-Sora-Plan is a project that aims to create a simple and scalable repo to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI"). The project is still in its early stages, but the team is working hard to improve it and make it more accessible to the open-source community. The project is currently focused on training an unconditional model on a landscape dataset, but the team plans to expand the scope of the project in the future to include text2video experiments, training on video2text datasets, and controlling the model with more conditions.
cool-admin-java
Cool-admin-java is an open-source backend permission management system with features like Ai coding, flow arrangement, modularity, and plugin support. It is used to quickly build backend applications. The system offers a modern development experience by providing functionalities such as one-click generation of API interfaces to frontend pages, drag-and-drop flow arrangement, modularized code for easy maintenance, and extensibility through plugin installation for features like payments, SMS, and emails.
auto-round
AutoRound is an advanced weight-only quantization algorithm for low-bits LLM inference. It competes impressively against recent methods without introducing any additional inference overhead. The method adopts sign gradient descent to fine-tune rounding values and minmax values of weights in just 200 steps, often significantly outperforming SignRound with the cost of more tuning time for quantization. AutoRound is tailored for a wide range of models and consistently delivers noticeable improvements.
Train-llm-from-scratch
Train-llm-from-scratch is a repository that guides users through training a Large Language Model (LLM) from scratch. The model size can be adjusted based on available computing power. The repository utilizes deepspeed for distributed training and includes detailed explanations of the code and key steps at each stage to facilitate learning. Users can train their own tokenizer or use pre-trained tokenizers like ChatGLM2-6B. The repository provides information on preparing pre-training data, processing training data, and recommended SFT data for fine-tuning. It also references other projects and books related to LLM training.
LLaVA-pp
This repository, LLaVA++, extends the visual capabilities of the LLaVA 1.5 model by incorporating the latest LLMs, Phi-3 Mini Instruct 3.8B, and LLaMA-3 Instruct 8B. It provides various models for instruction-following LMMS and academic-task-oriented datasets, along with training scripts for Phi-3-V and LLaMA-3-V. The repository also includes installation instructions and acknowledgments to related open-source contributions.
ComfyUI-BRIA_AI-RMBG
ComfyUI-BRIA_AI-RMBG is an unofficial implementation of the BRIA Background Removal v1.4 model for ComfyUI. The tool supports batch processing, including video background removal, and introduces a new mask output feature. Users can install the tool using ComfyUI Manager or manually by cloning the repository. The tool includes nodes for automatically loading the Removal v1.4 model and removing backgrounds. Updates include support for batch processing and the addition of a mask output feature.
For similar tasks
Steel-LLM
Steel-LLM is a project to pre-train a large Chinese language model from scratch using over 1T of data to achieve a parameter size of around 1B, similar to TinyLlama. The project aims to share the entire process including data collection, data processing, pre-training framework selection, model design, and open-source all the code. The goal is to enable reproducibility of the work even with limited resources. The name 'Steel' is inspired by a band '万能青年旅店' and signifies the desire to create a strong model despite limited conditions. The project involves continuous data collection of various cultural elements, trivia, lyrics, niche literature, and personal secrets to train the LLM. The ultimate aim is to fill the model with diverse data and leave room for individual input, fostering collaboration among users.
vllm
vLLM is a fast and easy-to-use library for LLM inference and serving. It is designed to be efficient, flexible, and easy to use. vLLM can be used to serve a variety of LLM models, including Hugging Face models. It supports a variety of decoding algorithms, including parallel sampling, beam search, and more. vLLM also supports tensor parallelism for distributed inference and streaming outputs. It is open-source and available on GitHub.
bce-qianfan-sdk
The Qianfan SDK provides best practices for large model toolchains, allowing AI workflows and AI-native applications to access the Qianfan large model platform elegantly and conveniently. The core capabilities of the SDK include three parts: large model reasoning, large model training, and general and extension: * `Large model reasoning`: Implements interface encapsulation for reasoning of Yuyan (ERNIE-Bot) series, open source large models, etc., supporting dialogue, completion, Embedding, etc. * `Large model training`: Based on platform capabilities, it supports end-to-end large model training process, including training data, fine-tuning/pre-training, and model services. * `General and extension`: General capabilities include common AI development tools such as Prompt/Debug/Client. The extension capability is based on the characteristics of Qianfan to adapt to common middleware frameworks.
dstack
Dstack is an open-source orchestration engine for running AI workloads in any cloud. It supports a wide range of cloud providers (such as AWS, GCP, Azure, Lambda, TensorDock, Vast.ai, CUDO, RunPod, etc.) as well as on-premises infrastructure. With Dstack, you can easily set up and manage dev environments, tasks, services, and pools for your AI workloads.
RVC_CLI
**RVC_CLI: Retrieval-based Voice Conversion Command Line Interface** This command-line interface (CLI) provides a comprehensive set of tools for voice conversion, enabling you to modify the pitch, timbre, and other characteristics of audio recordings. It leverages advanced machine learning models to achieve realistic and high-quality voice conversions. **Key Features:** * **Inference:** Convert the pitch and timbre of audio in real-time or process audio files in batch mode. * **TTS Inference:** Synthesize speech from text using a variety of voices and apply voice conversion techniques. * **Training:** Train custom voice conversion models to meet specific requirements. * **Model Management:** Extract, blend, and analyze models to fine-tune and optimize performance. * **Audio Analysis:** Inspect audio files to gain insights into their characteristics. * **API:** Integrate the CLI's functionality into your own applications or workflows. **Applications:** The RVC_CLI finds applications in various domains, including: * **Music Production:** Create unique vocal effects, harmonies, and backing vocals. * **Voiceovers:** Generate voiceovers with different accents, emotions, and styles. * **Audio Editing:** Enhance or modify audio recordings for podcasts, audiobooks, and other content. * **Research and Development:** Explore and advance the field of voice conversion technology. **For Jobs:** * Audio Engineer * Music Producer * Voiceover Artist * Audio Editor * Machine Learning Engineer **AI Keywords:** * Voice Conversion * Pitch Shifting * Timbre Modification * Machine Learning * Audio Processing **For Tasks:** * Convert Pitch * Change Timbre * Synthesize Speech * Train Model * Analyze Audio
llm-finetuning
llm-finetuning is a repository that provides a serverless twist to the popular axolotl fine-tuning library using Modal's serverless infrastructure. It allows users to quickly fine-tune any LLM model with state-of-the-art optimizations like Deepspeed ZeRO, LoRA adapters, Flash attention, and Gradient checkpointing. The repository simplifies the fine-tuning process by not exposing all CLI arguments, instead allowing users to specify options in a config file. It supports efficient training and scaling across multiple GPUs, making it suitable for production-ready fine-tuning jobs.
zeta
Zeta is a tool designed to build state-of-the-art AI models faster by providing modular, high-performance, and scalable building blocks. It addresses the common issues faced while working with neural nets, such as chaotic codebases, lack of modularity, and low performance modules. Zeta emphasizes usability, modularity, and performance, and is currently used in hundreds of models across various GitHub repositories. It enables users to prototype, train, optimize, and deploy the latest SOTA neural nets into production. The tool offers various modules like FlashAttention, SwiGLUStacked, RelativePositionBias, FeedForward, BitLinear, PalmE, Unet, VisionEmbeddings, niva, FusedDenseGELUDense, FusedDropoutLayerNorm, MambaBlock, Film, hyper_optimize, DPO, and ZetaCloud for different tasks in AI model development.
llm_qlora
LLM_QLoRA is a repository for fine-tuning Large Language Models (LLMs) using QLoRA methodology. It provides scripts for training LLMs on custom datasets, pushing models to HuggingFace Hub, and performing inference. Additionally, it includes models trained on HuggingFace Hub, a blog post detailing the QLoRA fine-tuning process, and instructions for converting and quantizing models. The repository also addresses troubleshooting issues related to Python versions and dependencies.
For similar jobs
sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.
teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.
chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.
BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students
uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.
griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.