llms-from-scratch-cn

仅需Python基础，从0构建大语言模型；从0逐步构建GLM4\Llama3\RWKV6，深入理解大模型原理

Stars: 860

Visit

This repository provides a detailed tutorial on how to build your own large language model (LLM) from scratch. It includes all the code necessary to create a GPT-like LLM, covering the encoding, pre-training, and fine-tuning processes. The tutorial is written in a clear and concise style, with plenty of examples and illustrations to help you understand the concepts involved. It is suitable for developers and researchers with some programming experience who are interested in learning more about LLMs and how to build them.

README:

动手学LLM

LLMs From Scratch: Hands-on Building Your Own Large Language Models

📘 项目介绍

如果你想从0手写代码，构建大语言模型，本项目很适合你。本项目 "LLMs From Scratch" 是由 Datawhale 提供的一个从头开始构建类似 ChatGPT 大型语言模型（LLM）的实践教程。我们旨在通过详细的指导、代码示例和深度学习资源，帮助开发者和研究者掌握创建大语言模型和大语言模型架构的核心技术。本项目包括了从0逐步构建GLM4\Llama3\RWKV6的教程，从0构建大模型，一起深入理解大模型原理。

🌟 项目亮点

全面的学习路径: 提供从基础理论到实际编码的系统化学习路径。
实践导向: 强调通过实际操作掌握 LLM 的开发和训练。
重点关注LLM架构: 在微调、部署相关教程较为丰富的背景下，我们着重关注大模型的架构实现。

🚀 主要内容

（1）基础知识

在基础知识部分，我们基于"rasbt/LLMs-from-scratch"提供了一个如何从头开始实现类似ChatGPT的大语言模型（LLM）的详细教程，特别感谢@rasbt。

如果你想快速入门，请参考Codes路径下的notebook，简洁的代码可以帮助你快速入门。

如果你想详细学习，请参考Translated_Book路径下的notebook，提供了更为详细的相关知识。

👨‍💻 代码实现: 该项目包含了创建GPT-like大语言模型的全部代码，涵盖了编码、预训练和微调过程。

📖 逐步学习: 教程通过清晰的文本、图表和示例，分步骤教授如何创建自己的LLM。

💡 教育目的: 该方法主要用于教育，帮助学习者训练和开发小型但功能性的模型，这与创建像ChatGPT这样的大型基础模型的方法相似。

🔧 简洁易懂的代码: 利用简洁且可运行的notebook代码，即使只有PyTorch基础，也能完成大模型的构建。

🤔 深入理解模型原理: 通过本教程，读者可以深入理解大型语言模型的工作原理。

📖 详细章节安排如下：

章节标题	主要代码	所有代码和补充
第1章: 理解大型语言模型	没有代码	没有代码
第2章: 处理文本数据	- ch02.ipynb - dataloader.ipynb - exercise-solutions.ipynb	./Codes/ch02
第3章: 编写注意力机制	- ch03.ipynb - multihead-attention.ipynb - exercise-solutions.ipynb	./Codes/ch03
第4章: 从零开始实现GPT模型	- ch04.ipynb - gpt.py - exercise-solutions.ipynb	./Codes/ch04
第5章: 使用未标记数据进行预训练	- ch05.ipynb - train.py - generate.py - exercise-solutions.ipynb	./Codes/ch05
第6章: 用于文本分类的微调	即将发布	即将发布
第7章: 使用人类反馈进行微调	即将发布	即将发布
第8章: 在实践中使用大型语言模型	即将发布	即将发布
附录A: PyTorch简介	- code-part1.ipynb - code-part2.ipynb - DDP-script.py - exercise-solutions.ipynb	appendix-A
附录B: 参考文献和进一步的阅读材料	没有代码	-
附录C: 练习	没有代码	-
附录D: 为训练过程添加额外的功能和特性	- appendix-D.ipynb	appendix-D

（2）模型架构的讨论和搭建

支持多种大型模型: 项目涵盖了 ChatGLM、Llama、RWKV 等多个大型模型的架构讨论与实现，详见 ./Model_Architecture_Discussions 目录。
架构详细解析: 包括每个模型的配置文件、训练脚本和核心代码，帮助学习者深入理解不同模型的内部机制。

模型类型	Notebook 笔记本	贡献者
ChatGLM3	chatglm3.ipynb	@Tangent-90C
Llama3	llama3.ipynb	@A10-research
RWKV V2	rwkv-v2.ipynb	@Ethan-Chen-plus
RWKV V3	rwkv-v3.ipynb	@Ethan-Chen-plus
RWKV V4	rwkv-v4.ipynb	@Ethan-Chen-plus
RWKV V5	rwkv-v5.ipynb	@Ethan-Chen-plus
RWKV V6	rwkv-v6.ipynb	@Ethan-Chen-plus
ChatGLM4	chatglm4.ipynb	@Ethan-Chen-plus

📅 Roadmap

注：规划未来任务，并通过 Issue 形式对外发布。

👫 参与贡献

如果你想参与到项目中，欢迎查看项目的 Issue 查看没有被分配的任务。
如果你发现了问题，请在 Issue 中进行反馈🐛。
如果你对本项目感兴趣，欢迎通过 Discussion 进行交流💬。

项目受众
- 技术背景：该项目适合有一定编程基础的人员，特别是对大型语言模型（LLM）感兴趣的开发者和研究者。
- 学习目标：适合那些希望深入了解LLM工作原理，并愿意投入时间从零开始构建和训练自己的LLM的学习者。
- 应用领域：适用于对自然语言处理、人工智能领域感兴趣的开发者，以及希望在教育或研究环境中应用LLM的人员。
项目亮点
- 系统化学习：该项目提供了一个系统化的学习路径，从理论基础到实际编码，帮助学习者全面理解LLM。
- 实践导向：与仅仅介绍理论或API使用不同，该项目强调实践，让学习者通过实际操作来掌握LLM的开发和训练。
- 深入浅出：该项目以清晰的语言、图表和示例来解释复杂的概念，使得非专业背景的学习者也能较好地理解。

如果你对 Datawhale 很感兴趣并想要发起一个新的项目，欢迎查看 Datawhale 贡献指南。

希望这个项目能够帮助你更好地理解和构建大型语言模型！ 🌐

贡献者名单（教程部分）

姓名	职责	简介	GitHub
陈可为	项目负责人	华中科技大学	@Ethan-Chen-plus
王训志	第2章贡献者	南开大学	@aJupyter
汪健麟	第2章贡献者
Aria	第2章贡献者		@ariafyy
汪健麟	第2章贡献者
张友东	第3章贡献者
邹雨衡	第3章贡献者
曹妍	第3章贡献者		@SamanthaTso
陈嘉诺	第4章贡献者	广州大学	@Tangent-90C
高立业	第4章贡献者
蒋文力	第4章贡献者		@morcake
丁悦	第5章贡献者	哈尔滨工业大学（威海）	@dingyue772
周景林	附录贡献者		@Beyondzjl
陈可为	附录贡献者		@Ethan-Chen-plus

关注我们

扫描下方二维码关注公众号：Datawhale

LICENSE

本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。

注：默认使用CC 4.0协议，也可根据自身项目情况选用其他协议

For Tasks:

Click tags to check more tools for each tasks

text generation text classification question answering machine translation chatbot development

For Jobs:

research scientist machine learning engineer data scientist natural language processing engineer artificial intelligence engineer

Alternative AI tools for llms-from-scratch-cn

Similar Open Source Tools

llms-from-scratch-cn

github

: 860

Awesome-LLMs-for-Video-Understanding

Awesome-LLMs-for-Video-Understanding is a repository dedicated to exploring Video Understanding with Large Language Models. It provides a comprehensive survey of the field, covering models, pretraining, instruction tuning, and hybrid methods. The repository also includes information on tasks, datasets, and benchmarks related to video understanding. Contributors are encouraged to add new papers, projects, and materials to enhance the repository.

github

: 1.8k

MindChat

MindChat is a psychological large language model designed to help individuals relieve psychological stress and solve mental confusion, ultimately improving mental health. It aims to provide a relaxed and open conversation environment for users to build trust and understanding. MindChat offers privacy, warmth, safety, timely, and convenient conversation settings to help users overcome difficulties and challenges, achieve self-growth, and development. The tool is suitable for both work and personal life scenarios, providing comprehensive psychological support and therapeutic assistance to users while strictly protecting user privacy. It combines psychological knowledge with artificial intelligence technology to contribute to a healthier, more inclusive, and equal society.

github

: 436

pmhub

PmHub is a smart project management system based on SpringCloud, SpringCloud Alibaba, and LLM. It aims to help students quickly grasp the architecture design and development process of microservices/distributed projects. PmHub provides a platform for students to experience the transformation from monolithic to microservices architecture, understand the pros and cons of both architectures, and prepare for job interviews. It offers popular technologies like SpringCloud-Gateway, Nacos, Sentinel, and provides high-quality code, continuous integration, product design documents, and an enterprise workflow system. PmHub is suitable for beginners and advanced learners who want to master core knowledge of microservices/distributed projects.

github

: 280

stylellm_models

**stylellm** is a text style transfer project based on large language models (llms). The project utilizes large language models to learn the writing style of specific literary works (commonly used vocabulary, sentence structure, rhetoric, character dialogue, etc.), forming a series of specific style models. Using the **stylellm** model, the learned style can be transferred to other general texts, that is: input a piece of original text, the model can rewrite it, output text with the characteristics of that style, achieving the effect of text modification,润色or style imitation.

github

: 121

Chinese-LLaMA-Alpaca

This project open sources the **Chinese LLaMA model and the Alpaca large model fine-tuned with instructions**, to further promote the open research of large models in the Chinese NLP community. These models **extend the Chinese vocabulary based on the original LLaMA** and use Chinese data for secondary pre-training, further enhancing the basic Chinese semantic understanding ability. At the same time, the Chinese Alpaca model further uses Chinese instruction data for fine-tuning, significantly improving the model's understanding and execution of instructions.

github

: 17.2k

AstrBot

AstrBot is a powerful and versatile tool that leverages the capabilities of large language models (LLMs) like GPT-3, GPT-3.5, and GPT-4 to enhance communication and automate tasks. It seamlessly integrates with popular messaging platforms such as QQ, QQ Channel, and Telegram, enabling users to harness the power of AI within their daily conversations and workflows.

github

: 6.6k

Awesome-Jailbreak-on-LLMs

Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, and exciting jailbreak methods on Large Language Models (LLMs). The repository contains papers, codes, datasets, evaluations, and analyses related to jailbreak attacks on LLMs. It serves as a comprehensive resource for researchers and practitioners interested in exploring various jailbreak techniques and defenses in the context of LLMs. Contributions such as additional jailbreak-related content, pull requests, and issue reports are welcome, and contributors are acknowledged. For any inquiries or issues, contact [email protected]. If you find this repository useful for your research or work, consider starring it to show appreciation.

github

: 507

AstrBot

github

: 7.0k

MedicalGPT

MedicalGPT is a training medical GPT model with ChatGPT training pipeline, implement of Pretraining, Supervised Finetuning, RLHF(Reward Modeling and Reinforcement Learning) and DPO(Direct Preference Optimization).

github

: 3.6k

Chinese-LLaMA-Alpaca-2

Chinese-LLaMA-Alpaca-2 is a large Chinese language model developed by Meta AI. It is based on the Llama-2 model and has been further trained on a large dataset of Chinese text. Chinese-LLaMA-Alpaca-2 can be used for a variety of natural language processing tasks, including text generation, question answering, and machine translation. Here are some of the key features of Chinese-LLaMA-Alpaca-2: * It is the largest Chinese language model ever trained, with 13 billion parameters. * It is trained on a massive dataset of Chinese text, including books, news articles, and social media posts. * It can be used for a variety of natural language processing tasks, including text generation, question answering, and machine translation. * It is open-source and available for anyone to use. Chinese-LLaMA-Alpaca-2 is a powerful tool that can be used to improve the performance of a wide range of natural language processing tasks. It is a valuable resource for researchers and developers working in the field of artificial intelligence.

github

: 6.8k

Chinese-LLaMA-Alpaca-3

Chinese-LLaMA-Alpaca-3 is a project based on Meta's latest release of the new generation open-source large model Llama-3. It is the third phase of the Chinese-LLaMA-Alpaca open-source large model series projects (Phase 1, Phase 2). This project open-sources the Chinese Llama-3 base model and the Chinese Llama-3-Instruct instruction fine-tuned large model. These models incrementally pre-train with a large amount of Chinese data on the basis of the original Llama-3 and further fine-tune using selected instruction data, enhancing Chinese basic semantics and instruction understanding capabilities. Compared to the second-generation related models, significant performance improvements have been achieved.

github

: 825

awesome-VLLMs

github

: 52

ai-app

The 'ai-app' repository is a comprehensive collection of tools and resources related to artificial intelligence, focusing on topics such as server environment setup, PyCharm and Anaconda installation, large model deployment and training, Transformer principles, RAG technology, vector databases, AI image, voice, and music generation, and AI Agent frameworks. It also includes practical guides and tutorials on implementing various AI applications. The repository serves as a valuable resource for individuals interested in exploring different aspects of AI technology.

github

: 103

Awesome-Knowledge-Distillation-of-LLMs

A collection of papers related to knowledge distillation of large language models (LLMs). The repository focuses on techniques to transfer advanced capabilities from proprietary LLMs to smaller models, compress open-source LLMs, and refine their performance. It covers various aspects of knowledge distillation, including algorithms, skill distillation, verticalization distillation in fields like law, medical & healthcare, finance, science, and miscellaneous domains. The repository provides a comprehensive overview of the research in the area of knowledge distillation of LLMs.

github

: 890

VoiceBench

VoiceBench is a repository containing code and data for benchmarking LLM-Based Voice Assistants. It includes a leaderboard with rankings of various voice assistant models based on different evaluation metrics. The repository provides setup instructions, datasets, evaluation procedures, and a curated list of awesome voice assistants. Users can submit new voice assistant results through the issue tracker for updates on the ranking list.

github

: 119

For similar tasks

mlc-llm

MLC LLM is a high-performance universal deployment solution that allows native deployment of any large language models with native APIs with compiler acceleration. It supports a wide range of model architectures and variants, including Llama, GPT-NeoX, GPT-J, RWKV, MiniGPT, GPTBigCode, ChatGLM, StableLM, Mistral, and Phi. MLC LLM provides multiple sets of APIs across platforms and environments, including Python API, OpenAI-compatible Rest-API, C++ API, JavaScript API and Web LLM, Swift API for iOS App, and Java API and Android App.

github

: 20.2k

exllamav2

ExLlamaV2 is an inference library for running local LLMs on modern consumer GPUs. It is a faster, better, and more versatile codebase than its predecessor, ExLlamaV1, with support for a new quant format called EXL2. EXL2 is based on the same optimization method as GPTQ and supports 2, 3, 4, 5, 6, and 8-bit quantization. It allows for mixing quantization levels within a model to achieve any average bitrate between 2 and 8 bits per weight. ExLlamaV2 can be installed from source, from a release with prebuilt extension, or from PyPI. It supports integration with TabbyAPI, ExUI, text-generation-webui, and lollms-webui. Key features of ExLlamaV2 include: - Faster and better kernels - Cleaner and more versatile codebase - Support for EXL2 quantization format - Integration with various web UIs and APIs - Community support on Discord

github

: 3.5k

Tutorial

The Bookworm·Puyu large model training camp aims to promote the implementation of large models in more industries and provide developers with a more efficient platform for learning the development and application of large models. Within two weeks, you will learn the entire process of fine-tuning, deploying, and evaluating large models.

github

: 1.6k

llama-api-server

This project aims to create a RESTful API server compatible with the OpenAI API using open-source backends like llama/llama2. With this project, various GPT tools/frameworks can be compatible with your own model. Key features include: - **Compatibility with OpenAI API**: The API server follows the OpenAI API structure, allowing seamless integration with existing tools and frameworks. - **Support for Multiple Backends**: The server supports both llama.cpp and pyllama backends, providing flexibility in model selection. - **Customization Options**: Users can configure model parameters such as temperature, top_p, and top_k to fine-tune the model's behavior. - **Batch Processing**: The API supports batch processing for embeddings, enabling efficient handling of multiple inputs. - **Token Authentication**: The server utilizes token authentication to secure access to the API. This tool is particularly useful for developers and researchers who want to integrate large language models into their applications or explore custom models without relying on proprietary APIs.

github

: 204

llms-from-scratch-cn

github

: 860

llm_interview_note

This repository provides a comprehensive overview of large language models (LLMs), covering various aspects such as their history, types, underlying architecture, training techniques, and applications. It includes detailed explanations of key concepts like Transformer models, distributed training, fine-tuning, and reinforcement learning. The repository also discusses the evaluation and limitations of LLMs, including the phenomenon of hallucinations. Additionally, it provides a list of related courses and references for further exploration.

github

: 2.1k

SeaLLMs

SeaLLMs are a family of language models optimized for Southeast Asian (SEA) languages. They were pre-trained from Llama-2, on a tailored publicly-available dataset, which comprises texts in Vietnamese 🇻🇳, Indonesian 🇮🇩, Thai 🇹🇭, Malay 🇲🇾, Khmer🇰🇭, Lao🇱🇦, Tagalog🇵🇭 and Burmese🇲🇲. The SeaLLM-chat underwent supervised finetuning (SFT) and specialized self-preferencing DPO using a mix of public instruction data and a small number of queries used by SEA language native speakers in natural settings, which **adapt to the local cultural norms, customs, styles and laws in these areas**. SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities relative to comparable open-source models. Moreover, they outperform **ChatGPT-3.5** in non-Latin languages, such as Thai, Khmer, Lao, and Burmese.

github

: 130

PhoGPT

PhoGPT is an open-source 4B-parameter generative model series for Vietnamese, including the base pre-trained monolingual model PhoGPT-4B and its chat variant, PhoGPT-4B-Chat. PhoGPT-4B is pre-trained from scratch on a Vietnamese corpus of 102B tokens, with an 8192 context length and a vocabulary of 20K token types. PhoGPT-4B-Chat is fine-tuned on instructional prompts and conversations, demonstrating superior performance. Users can run the model with inference engines like vLLM and Text Generation Inference, and fine-tune it using llm-foundry. However, PhoGPT has limitations in reasoning, coding, and mathematics tasks, and may generate harmful or biased responses.

github

: 739

For similar jobs

LLM-FineTuning-Large-Language-Models

This repository contains projects and notes on common practical techniques for fine-tuning Large Language Models (LLMs). It includes fine-tuning LLM notebooks, Colab links, LLM techniques and utils, and other smaller language models. The repository also provides links to YouTube videos explaining the concepts and techniques discussed in the notebooks.

github

: 319

lloco

LLoCO is a technique that learns documents offline through context compression and in-domain parameter-efficient finetuning using LoRA, which enables LLMs to handle long context efficiently.

github

: 60

camel

CAMEL is an open-source library designed for the study of autonomous and communicative agents. We believe that studying these agents on a large scale offers valuable insights into their behaviors, capabilities, and potential risks. To facilitate research in this field, we implement and support various types of agents, tasks, prompts, models, and simulated environments.

github

: 11.5k

llm-baselines

LLM-baselines is a modular codebase to experiment with transformers, inspired from NanoGPT. It provides a quick and easy way to train and evaluate transformer models on a variety of datasets. The codebase is well-documented and easy to use, making it a great resource for researchers and practitioners alike.

github

: 58

python-tutorial-notebooks

This repository contains Jupyter-based tutorials for NLP, ML, AI in Python for classes in Computational Linguistics, Natural Language Processing (NLP), Machine Learning (ML), and Artificial Intelligence (AI) at Indiana University.

github

: 121

EvalAI

EvalAI is an open-source platform for evaluating and comparing machine learning (ML) and artificial intelligence (AI) algorithms at scale. It provides a central leaderboard and submission interface, making it easier for researchers to reproduce results mentioned in papers and perform reliable & accurate quantitative analysis. EvalAI also offers features such as custom evaluation protocols and phases, remote evaluation, evaluation inside environments, CLI support, portability, and faster evaluation.

github

: 1.8k

Weekly-Top-LLM-Papers

This repository provides a curated list of weekly published Large Language Model (LLM) papers. It includes top important LLM papers for each week, organized by month and year. The papers are categorized into different time periods, making it easy to find the most recent and relevant research in the field of LLM.

github

: 149

self-llm

This project is a Chinese tutorial for domestic beginners based on the AutoDL platform, providing full-process guidance for various open-source large models, including environment configuration, local deployment, and efficient fine-tuning. It simplifies the deployment, use, and application process of open-source large models, enabling more ordinary students and researchers to better use open-source large models and helping open and free large models integrate into the lives of ordinary learners faster.

github

: 14.0k