llms-from-scratch-cn
仅需Python基础,从0构建大语言模型;从0逐步构建GLM4\Llama3\RWKV6, 深入理解大模型原理
Stars: 860
This repository provides a detailed tutorial on how to build your own large language model (LLM) from scratch. It includes all the code necessary to create a GPT-like LLM, covering the encoding, pre-training, and fine-tuning processes. The tutorial is written in a clear and concise style, with plenty of examples and illustrations to help you understand the concepts involved. It is suitable for developers and researchers with some programming experience who are interested in learning more about LLMs and how to build them.
README:
如果你想从0手写代码,构建大语言模型,本项目很适合你。 本项目 "LLMs From Scratch" 是由 Datawhale 提供的一个从头开始构建类似 ChatGPT 大型语言模型(LLM)的实践教程。 我们旨在通过详细的指导、代码示例和深度学习资源,帮助开发者和研究者掌握创建大语言模型和大语言模型架构的核心技术。 本项目包括了从0逐步构建GLM4\Llama3\RWKV6的教程,从0构建大模型,一起深入理解大模型原理。
- 全面的学习路径: 提供从基础理论到实际编码的系统化学习路径。
- 实践导向: 强调通过实际操作掌握 LLM 的开发和训练。
- 重点关注LLM架构: 在微调、部署相关教程较为丰富的背景下,我们着重关注大模型的架构实现。
在基础知识部分,我们基于"rasbt/LLMs-from-scratch"提供了一个如何从头开始实现类似ChatGPT的大语言模型(LLM)的详细教程,特别感谢@rasbt。
如果你想快速入门,请参考Codes路径下的notebook,简洁的代码可以帮助你快速入门。
如果你想详细学习,请参考Translated_Book路径下的notebook,提供了更为详细的相关知识。
👨💻 代码实现: 该项目包含了创建GPT-like大语言模型的全部代码,涵盖了编码、预训练和微调过程。
📖 逐步学习: 教程通过清晰的文本、图表和示例,分步骤教授如何创建自己的LLM。
💡 教育目的: 该方法主要用于教育,帮助学习者训练和开发小型但功能性的模型,这与创建像ChatGPT这样的大型基础模型的方法相似。
🔧 简洁易懂的代码: 利用简洁且可运行的notebook代码,即使只有PyTorch基础,也能完成大模型的构建。
🤔 深入理解模型原理: 通过本教程,读者可以深入理解大型语言模型的工作原理。
📖 详细章节安排如下:
章节标题 | 主要代码 | 所有代码和补充 |
---|---|---|
第1章: 理解大型语言模型 | 没有代码 | 没有代码 |
第2章: 处理文本数据 | - ch02.ipynb - dataloader.ipynb - exercise-solutions.ipynb |
./Codes/ch02 |
第3章: 编写注意力机制 | - ch03.ipynb - multihead-attention.ipynb - exercise-solutions.ipynb |
./Codes/ch03 |
第4章: 从零开始实现GPT模型 | - ch04.ipynb - gpt.py - exercise-solutions.ipynb |
./Codes/ch04 |
第5章: 使用未标记数据进行预训练 | - ch05.ipynb - train.py - generate.py - exercise-solutions.ipynb |
./Codes/ch05 |
第6章: 用于文本分类的微调 | 即将发布 | 即将发布 |
第7章: 使用人类反馈进行微调 | 即将发布 | 即将发布 |
第8章: 在实践中使用大型语言模型 | 即将发布 | 即将发布 |
附录A: PyTorch简介 | - code-part1.ipynb - code-part2.ipynb - DDP-script.py - exercise-solutions.ipynb |
appendix-A |
附录B: 参考文献和进一步的阅读材料 | 没有代码 | - |
附录C: 练习 | 没有代码 | - |
附录D: 为训练过程添加额外的功能和特性 | - appendix-D.ipynb | appendix-D |
-
支持多种大型模型: 项目涵盖了 ChatGLM、Llama、RWKV 等多个大型模型的架构讨论与实现,详见
./Model_Architecture_Discussions
目录。 - 架构详细解析: 包括每个模型的配置文件、训练脚本和核心代码,帮助学习者深入理解不同模型的内部机制。
模型类型 | Notebook 笔记本 | 贡献者 |
---|---|---|
ChatGLM3 | chatglm3.ipynb | @Tangent-90C |
Llama3 | llama3.ipynb | @A10-research |
RWKV V2 | rwkv-v2.ipynb | @Ethan-Chen-plus |
RWKV V3 | rwkv-v3.ipynb | @Ethan-Chen-plus |
RWKV V4 | rwkv-v4.ipynb | @Ethan-Chen-plus |
RWKV V5 | rwkv-v5.ipynb | @Ethan-Chen-plus |
RWKV V6 | rwkv-v6.ipynb | @Ethan-Chen-plus |
ChatGLM4 | chatglm4.ipynb | @Ethan-Chen-plus |
注:规划未来任务,并通过 Issue 形式对外发布。
- 如果你想参与到项目中,欢迎查看项目的 Issue 查看没有被分配的任务。
- 如果你发现了问题,请在 Issue 中进行反馈🐛。
- 如果你对本项目感兴趣,欢迎通过 Discussion 进行交流💬。
-
项目受众
-
- 技术背景:该项目适合有一定编程基础的人员,特别是对大型语言模型(LLM)感兴趣的开发者和研究者。
- 学习目标:适合那些希望深入了解LLM工作原理,并愿意投入时间从零开始构建和训练自己的LLM的学习者。
- 应用领域:适用于对自然语言处理、人工智能领域感兴趣的开发者,以及希望在教育或研究环境中应用LLM的人员。
-
项目亮点
- 系统化学习:该项目提供了一个系统化的学习路径,从理论基础到实际编码,帮助学习者全面理解LLM。
- 实践导向:与仅仅介绍理论或API使用不同,该项目强调实践,让学习者通过实际操作来掌握LLM的开发和训练。
- 深入浅出:该项目以清晰的语言、图表和示例来解释复杂的概念,使得非专业背景的学习者也能较好地理解。
如果你对 Datawhale 很感兴趣并想要发起一个新的项目,欢迎查看 Datawhale 贡献指南。
希望这个项目能够帮助你更好地理解和构建大型语言模型! 🌐
姓名 | 职责 | 简介 | GitHub |
---|---|---|---|
陈可为 | 项目负责人 | 华中科技大学 | @Ethan-Chen-plus |
王训志 | 第2章贡献者 | 南开大学 | @aJupyter |
汪健麟 | 第2章贡献者 | ||
Aria | 第2章贡献者 | @ariafyy | |
汪健麟 | 第2章贡献者 | ||
张友东 | 第3章贡献者 | ||
邹雨衡 | 第3章贡献者 | ||
曹 妍 | 第3章贡献者 | @SamanthaTso | |
陈嘉诺 | 第4章贡献者 | 广州大学 | @Tangent-90C |
高立业 | 第4章贡献者 | ||
蒋文力 | 第4章贡献者 | @morcake | |
丁悦 | 第5章贡献者 | 哈尔滨工业大学(威海) | @dingyue772 |
周景林 | 附录贡献者 | @Beyondzjl | |
陈可为 | 附录贡献者 | @Ethan-Chen-plus |
本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。
注:默认使用CC 4.0协议,也可根据自身项目情况选用其他协议
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for llms-from-scratch-cn
Similar Open Source Tools
llms-from-scratch-cn
This repository provides a detailed tutorial on how to build your own large language model (LLM) from scratch. It includes all the code necessary to create a GPT-like LLM, covering the encoding, pre-training, and fine-tuning processes. The tutorial is written in a clear and concise style, with plenty of examples and illustrations to help you understand the concepts involved. It is suitable for developers and researchers with some programming experience who are interested in learning more about LLMs and how to build them.
Awesome-LLMs-for-Video-Understanding
Awesome-LLMs-for-Video-Understanding is a repository dedicated to exploring Video Understanding with Large Language Models. It provides a comprehensive survey of the field, covering models, pretraining, instruction tuning, and hybrid methods. The repository also includes information on tasks, datasets, and benchmarks related to video understanding. Contributors are encouraged to add new papers, projects, and materials to enhance the repository.
MedicalGPT
MedicalGPT is a training medical GPT model with ChatGPT training pipeline, implement of Pretraining, Supervised Finetuning, RLHF(Reward Modeling and Reinforcement Learning) and DPO(Direct Preference Optimization).
MindChat
MindChat is a psychological large language model designed to help individuals relieve psychological stress and solve mental confusion, ultimately improving mental health. It aims to provide a relaxed and open conversation environment for users to build trust and understanding. MindChat offers privacy, warmth, safety, timely, and convenient conversation settings to help users overcome difficulties and challenges, achieve self-growth, and development. The tool is suitable for both work and personal life scenarios, providing comprehensive psychological support and therapeutic assistance to users while strictly protecting user privacy. It combines psychological knowledge with artificial intelligence technology to contribute to a healthier, more inclusive, and equal society.
pmhub
PmHub is a smart project management system based on SpringCloud, SpringCloud Alibaba, and LLM. It aims to help students quickly grasp the architecture design and development process of microservices/distributed projects. PmHub provides a platform for students to experience the transformation from monolithic to microservices architecture, understand the pros and cons of both architectures, and prepare for job interviews. It offers popular technologies like SpringCloud-Gateway, Nacos, Sentinel, and provides high-quality code, continuous integration, product design documents, and an enterprise workflow system. PmHub is suitable for beginners and advanced learners who want to master core knowledge of microservices/distributed projects.
gpupixel
GPUPixel is a real-time, high-performance image and video filter library written in C++11 and based on OpenGL/ES. It incorporates a built-in beauty face filter that achieves commercial-grade beauty effects. The library is extremely easy to compile and integrate with a small size, supporting platforms including iOS, Android, Mac, Windows, and Linux. GPUPixel provides various filters like skin smoothing, whitening, face slimming, big eyes, lipstick, and blush. It supports input formats like YUV420P, RGBA, JPEG, PNG, and output formats like RGBA and YUV420P. The library's performance on devices like iPhone and Android is optimized, with low CPU usage and fast processing times. GPUPixel's lib size is compact, making it suitable for mobile and desktop applications.
stylellm_models
**stylellm** is a text style transfer project based on large language models (llms). The project utilizes large language models to learn the writing style of specific literary works (commonly used vocabulary, sentence structure, rhetoric, character dialogue, etc.), forming a series of specific style models. Using the **stylellm** model, the learned style can be transferred to other general texts, that is: input a piece of original text, the model can rewrite it, output text with the characteristics of that style, achieving the effect of text modification,润色or style imitation.
Chinese-LLaMA-Alpaca
This project open sources the **Chinese LLaMA model and the Alpaca large model fine-tuned with instructions**, to further promote the open research of large models in the Chinese NLP community. These models **extend the Chinese vocabulary based on the original LLaMA** and use Chinese data for secondary pre-training, further enhancing the basic Chinese semantic understanding ability. At the same time, the Chinese Alpaca model further uses Chinese instruction data for fine-tuning, significantly improving the model's understanding and execution of instructions.
Awesome-Jailbreak-on-LLMs
Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, and exciting jailbreak methods on Large Language Models (LLMs). The repository contains papers, codes, datasets, evaluations, and analyses related to jailbreak attacks on LLMs. It serves as a comprehensive resource for researchers and practitioners interested in exploring various jailbreak techniques and defenses in the context of LLMs. Contributions such as additional jailbreak-related content, pull requests, and issue reports are welcome, and contributors are acknowledged. For any inquiries or issues, contact [email protected]. If you find this repository useful for your research or work, consider starring it to show appreciation.
Chinese-LLaMA-Alpaca-2
Chinese-LLaMA-Alpaca-2 is a large Chinese language model developed by Meta AI. It is based on the Llama-2 model and has been further trained on a large dataset of Chinese text. Chinese-LLaMA-Alpaca-2 can be used for a variety of natural language processing tasks, including text generation, question answering, and machine translation. Here are some of the key features of Chinese-LLaMA-Alpaca-2: * It is the largest Chinese language model ever trained, with 13 billion parameters. * It is trained on a massive dataset of Chinese text, including books, news articles, and social media posts. * It can be used for a variety of natural language processing tasks, including text generation, question answering, and machine translation. * It is open-source and available for anyone to use. Chinese-LLaMA-Alpaca-2 is a powerful tool that can be used to improve the performance of a wide range of natural language processing tasks. It is a valuable resource for researchers and developers working in the field of artificial intelligence.
Chinese-LLaMA-Alpaca-3
Chinese-LLaMA-Alpaca-3 is a project based on Meta's latest release of the new generation open-source large model Llama-3. It is the third phase of the Chinese-LLaMA-Alpaca open-source large model series projects (Phase 1, Phase 2). This project open-sources the Chinese Llama-3 base model and the Chinese Llama-3-Instruct instruction fine-tuned large model. These models incrementally pre-train with a large amount of Chinese data on the basis of the original Llama-3 and further fine-tune using selected instruction data, enhancing Chinese basic semantics and instruction understanding capabilities. Compared to the second-generation related models, significant performance improvements have been achieved.
ai-app
The 'ai-app' repository is a comprehensive collection of tools and resources related to artificial intelligence, focusing on topics such as server environment setup, PyCharm and Anaconda installation, large model deployment and training, Transformer principles, RAG technology, vector databases, AI image, voice, and music generation, and AI Agent frameworks. It also includes practical guides and tutorials on implementing various AI applications. The repository serves as a valuable resource for individuals interested in exploring different aspects of AI technology.
Awesome-Knowledge-Distillation-of-LLMs
A collection of papers related to knowledge distillation of large language models (LLMs). The repository focuses on techniques to transfer advanced capabilities from proprietary LLMs to smaller models, compress open-source LLMs, and refine their performance. It covers various aspects of knowledge distillation, including algorithms, skill distillation, verticalization distillation in fields like law, medical & healthcare, finance, science, and miscellaneous domains. The repository provides a comprehensive overview of the research in the area of knowledge distillation of LLMs.
aidea-server
AIdea Server is an open-source Golang-based server that integrates mainstream large language models and drawing models. It supports various functionalities including OpenAI's GPT-3.5 and GPT-4, Anthropic's Claude instant and Claude 2.1, Google's Gemini Pro, as well as Chinese models like Tongyi Qianwen, Wenxin Yiyuan, and more. It also supports open-source large models like Yi 34B, Llama2, and AquilaChat 7B. Additionally, it provides features for text-to-image, super-resolution, coloring black and white images, generating art fonts and QR codes, among others.
Tiktoken
Tiktoken is a high-performance implementation focused on token count operations. It provides various encodings like o200k_base, cl100k_base, r50k_base, p50k_base, and p50k_edit. Users can easily encode and decode text using the provided API. The repository also includes a benchmark console app for performance tracking. Contributions in the form of PRs are welcome.
DISC-LawLLM
DISC-LawLLM is a legal domain large model that aims to provide professional, intelligent, and comprehensive **legal services** to users. It is developed and open-sourced by the Data Intelligence and Social Computing Lab (Fudan-DISC) at Fudan University.
For similar tasks
mlc-llm
MLC LLM is a high-performance universal deployment solution that allows native deployment of any large language models with native APIs with compiler acceleration. It supports a wide range of model architectures and variants, including Llama, GPT-NeoX, GPT-J, RWKV, MiniGPT, GPTBigCode, ChatGLM, StableLM, Mistral, and Phi. MLC LLM provides multiple sets of APIs across platforms and environments, including Python API, OpenAI-compatible Rest-API, C++ API, JavaScript API and Web LLM, Swift API for iOS App, and Java API and Android App.
exllamav2
ExLlamaV2 is an inference library for running local LLMs on modern consumer GPUs. It is a faster, better, and more versatile codebase than its predecessor, ExLlamaV1, with support for a new quant format called EXL2. EXL2 is based on the same optimization method as GPTQ and supports 2, 3, 4, 5, 6, and 8-bit quantization. It allows for mixing quantization levels within a model to achieve any average bitrate between 2 and 8 bits per weight. ExLlamaV2 can be installed from source, from a release with prebuilt extension, or from PyPI. It supports integration with TabbyAPI, ExUI, text-generation-webui, and lollms-webui. Key features of ExLlamaV2 include: - Faster and better kernels - Cleaner and more versatile codebase - Support for EXL2 quantization format - Integration with various web UIs and APIs - Community support on Discord
Tutorial
The Bookworm·Puyu large model training camp aims to promote the implementation of large models in more industries and provide developers with a more efficient platform for learning the development and application of large models. Within two weeks, you will learn the entire process of fine-tuning, deploying, and evaluating large models.
llama-api-server
This project aims to create a RESTful API server compatible with the OpenAI API using open-source backends like llama/llama2. With this project, various GPT tools/frameworks can be compatible with your own model. Key features include: - **Compatibility with OpenAI API**: The API server follows the OpenAI API structure, allowing seamless integration with existing tools and frameworks. - **Support for Multiple Backends**: The server supports both llama.cpp and pyllama backends, providing flexibility in model selection. - **Customization Options**: Users can configure model parameters such as temperature, top_p, and top_k to fine-tune the model's behavior. - **Batch Processing**: The API supports batch processing for embeddings, enabling efficient handling of multiple inputs. - **Token Authentication**: The server utilizes token authentication to secure access to the API. This tool is particularly useful for developers and researchers who want to integrate large language models into their applications or explore custom models without relying on proprietary APIs.
llms-from-scratch-cn
This repository provides a detailed tutorial on how to build your own large language model (LLM) from scratch. It includes all the code necessary to create a GPT-like LLM, covering the encoding, pre-training, and fine-tuning processes. The tutorial is written in a clear and concise style, with plenty of examples and illustrations to help you understand the concepts involved. It is suitable for developers and researchers with some programming experience who are interested in learning more about LLMs and how to build them.
llm_interview_note
This repository provides a comprehensive overview of large language models (LLMs), covering various aspects such as their history, types, underlying architecture, training techniques, and applications. It includes detailed explanations of key concepts like Transformer models, distributed training, fine-tuning, and reinforcement learning. The repository also discusses the evaluation and limitations of LLMs, including the phenomenon of hallucinations. Additionally, it provides a list of related courses and references for further exploration.
SeaLLMs
SeaLLMs are a family of language models optimized for Southeast Asian (SEA) languages. They were pre-trained from Llama-2, on a tailored publicly-available dataset, which comprises texts in Vietnamese 🇻🇳, Indonesian 🇮🇩, Thai 🇹🇭, Malay 🇲🇾, Khmer🇰🇭, Lao🇱🇦, Tagalog🇵🇭 and Burmese🇲🇲. The SeaLLM-chat underwent supervised finetuning (SFT) and specialized self-preferencing DPO using a mix of public instruction data and a small number of queries used by SEA language native speakers in natural settings, which **adapt to the local cultural norms, customs, styles and laws in these areas**. SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities relative to comparable open-source models. Moreover, they outperform **ChatGPT-3.5** in non-Latin languages, such as Thai, Khmer, Lao, and Burmese.
PhoGPT
PhoGPT is an open-source 4B-parameter generative model series for Vietnamese, including the base pre-trained monolingual model PhoGPT-4B and its chat variant, PhoGPT-4B-Chat. PhoGPT-4B is pre-trained from scratch on a Vietnamese corpus of 102B tokens, with an 8192 context length and a vocabulary of 20K token types. PhoGPT-4B-Chat is fine-tuned on instructional prompts and conversations, demonstrating superior performance. Users can run the model with inference engines like vLLM and Text Generation Inference, and fine-tune it using llm-foundry. However, PhoGPT has limitations in reasoning, coding, and mathematics tasks, and may generate harmful or biased responses.
For similar jobs
LLM-FineTuning-Large-Language-Models
This repository contains projects and notes on common practical techniques for fine-tuning Large Language Models (LLMs). It includes fine-tuning LLM notebooks, Colab links, LLM techniques and utils, and other smaller language models. The repository also provides links to YouTube videos explaining the concepts and techniques discussed in the notebooks.
lloco
LLoCO is a technique that learns documents offline through context compression and in-domain parameter-efficient finetuning using LoRA, which enables LLMs to handle long context efficiently.
camel
CAMEL is an open-source library designed for the study of autonomous and communicative agents. We believe that studying these agents on a large scale offers valuable insights into their behaviors, capabilities, and potential risks. To facilitate research in this field, we implement and support various types of agents, tasks, prompts, models, and simulated environments.
llm-baselines
LLM-baselines is a modular codebase to experiment with transformers, inspired from NanoGPT. It provides a quick and easy way to train and evaluate transformer models on a variety of datasets. The codebase is well-documented and easy to use, making it a great resource for researchers and practitioners alike.
python-tutorial-notebooks
This repository contains Jupyter-based tutorials for NLP, ML, AI in Python for classes in Computational Linguistics, Natural Language Processing (NLP), Machine Learning (ML), and Artificial Intelligence (AI) at Indiana University.
EvalAI
EvalAI is an open-source platform for evaluating and comparing machine learning (ML) and artificial intelligence (AI) algorithms at scale. It provides a central leaderboard and submission interface, making it easier for researchers to reproduce results mentioned in papers and perform reliable & accurate quantitative analysis. EvalAI also offers features such as custom evaluation protocols and phases, remote evaluation, evaluation inside environments, CLI support, portability, and faster evaluation.
Weekly-Top-LLM-Papers
This repository provides a curated list of weekly published Large Language Model (LLM) papers. It includes top important LLM papers for each week, organized by month and year. The papers are categorized into different time periods, making it easy to find the most recent and relevant research in the field of LLM.
self-llm
This project is a Chinese tutorial for domestic beginners based on the AutoDL platform, providing full-process guidance for various open-source large models, including environment configuration, local deployment, and efficient fine-tuning. It simplifies the deployment, use, and application process of open-source large models, enabling more ordinary students and researchers to better use open-source large models and helping open and free large models integrate into the lives of ordinary learners faster.