LLMBook-zh.github.io

《大语言模型》作者：赵鑫，李军毅，周昆，唐天一，文继荣

Stars: 242

Visit

This book aims to provide readers with a comprehensive understanding of large language model technology, including its basic principles, key technologies, and application prospects. Through in-depth research and practice, we can continuously explore and improve large language model technology, and contribute to the development of the field of artificial intelligence.

README:

大语言模型

作者：赵鑫，李军毅，周昆，唐天一，文继荣

关于本书

2022年底，ChatGPT 震撼上线，大语言模型技术迅速“席卷”了整个社会，人工智能技术因此迎来了一次重要进展。面对大语言模型的强大性能，我们不禁要问：支撑这些模型的背后技术究竟是什么？这一问题无疑成为了众多科研人员的思考焦点。必须指出的是，大模型技术并不是一蹴而就，其发展历程中先后经历了统计语言模型、神经网络语言模型、预训练语言模型等多个发展阶段，每一步的发展都凝结了众多科研工作者的心血与成果。作为大语言模型技术的重要推动者，OpenAI公司在过去深入探索了与其相关的大量技术细节，并最终推出了GPT系列模型，引领了本次技术变革。

然而，OpenAI 团队自GPT-3开始，就很少在公开的材料中提及相关技术细节，很多技术报告主要是介绍评测相关的内容。到目前为止，关于GPT系列模型的核心技术仍然难以完全解密。目前，学术界面临的重大挑战是真正有充足资源去充分探索大语言模型训练的团队少之又少，因此导致了第一手经验匮乏，难以直接开展相关研究。大模型训练涉及众多训练的细节，这些细节很多时候无法从已有科研论文中直接获取。由于其参数众多、组件复杂、训练过程也比较复杂，早期的实验探索如果不引入任何先验知识，可能会导致指数级增长的实验数量。这使得掌握大模型技术的经验变得尤为困难，更不用说从零开始探索相关科研问题，极大限制了学术界在此次人工浪潮中所起到的作用。目前，能力较强的大语言模型基本都源自工业界，这一趋势随着时间的推移可能会变得更加明显。从第一手经验中“Know-How”，对于科研人员来说非常重要，只有接触到技术核心，才能真正理解哪些问题是有意义的，并找到解决方案。

令人欣喜的是，无论是在学术界还是工业界，人们都逐渐认识到了“开放”的重要性，能够看到越来越多的公开的基础模型、技术代码以及学术论文，有力地推动了大模型技术的“透明化”。只有通过开放和共享，才能汇聚全人类的智慧，共同推进人工智能技术的发展。实际上，根据现有公开的资料，大模型技术也是“有章可循”的，如整体训练流程、数据清洗方法、指令微调技术、人类偏好对齐算法等。根据这些技术，在算力资源支持下，研发人员已经能够较为顺利地完成大模型的整体训练流程，并取得不错的模型效果。随着更多核心技术的揭示和开放，大模型技术的“透明化”将进一步提高。

总之，大模型技术正处于快速发展阶段，基础原理亟待探索、关键技术亟待改善。对于科研人员而言，大模型研究工作充满了想象空间，令人为之神往。随着技术的不断进步与共享开放，我们有理由相信，未来人工智能技术将取得更大的进展，将在更多领域带来更为深远的影响。本书旨在为读者提供关于大模型技术的全面了解，包括其基础原理、关键技术和应用前景。通过深入研究和实践，我们可以不断探索和改进大模型技术，为人工智能领域的发展做出贡献。我们希望读者通过阅读本书，能够深入了解大模型技术的现状和未来趋势，为自己的研究和实践提供指导和启发。让我们携手努力，共同推动人工智能技术的发展，为建立更智能、更可持续的未来做出贡献。

本书内容

下载地址

全书内容： 《大语言模型》 (updated 2024-04-15)

在2023年12月底，为了更好地提供大模型技术的中文参考资料，我们启动了中文书的编写工作，并且于近日完成初稿。中文版书籍注重为大模型技术的入门读者提供讲解，力图展现一个整体的大模型技术框架和路线图。本书适用于具有深度学习基础的高年级本科生以及低年级研究生使用，可以作为一本入门级的技术书籍。
下载链接1，下载链接2

英文综述论文：LLMSurvey

为了更好地整理和传播大模型技术的最新进展与技术体系，我们在2023年3月发表了大语言模型英文综述文章《A Survey of Large Language Models》，并不断进行更新完善。这篇综述文章已经更新到第13个版本，包含了83页的正文内容，并收录了900余篇参考文献。
自英文综述文章上线后，陆续有读者询问是否有对应的中文版本。为此，我们于2023年8月发布了该综述（v10）的中文翻译版。

配套资源

LLMBox：代码库

LLMBox是一个全面的代码工具库，专门用于开发和实现大语言模型，其基于统一化的训练流程和全面的模型评估框架。LLMBox旨在成为训练和利用大语言模型的一站式解决方案，其内部集成了大量实用的功能，实现了训练和利用阶段高度的灵活性和效率。

YuLan大模型：代码库

YuLan系列模型是中国人民大学高瓴人工智能学院师生共同开发的支持聊天的大语言模型（名字"玉兰"取自中国人民大学校花）。最新版本从头完成了整个预训练过程，并采用课程学习技术基于中英文双语数据进行有监督微调，包括高质量指令和人类偏好数据。

章节组织

引用信息

赵鑫,李军毅,周昆,唐天一,文继荣，大语言模型，https://llmbook-zh.github.io/，2024.

@book{LLMBook,
  title = {大语言模型},
  year = {2024},
  author = {赵鑫, 李军毅, 周昆, 唐天一, 文继荣},
  address = {北京},
  url = {https://llmbook-zh.github.io/},
}

内容贡献表

本书各章节的主要负责人和参与人名单如下：

第三章的负责人是闵映乾和杨晨，参与人有李军毅、周昆；
第四章的负责人是张君杰、侯宇蓬和周昆；
第五章的负责人是董梓灿，参与人有田震和唐天一；
第六章的负责人是唐天一和陈昱硕；
第七章的负责人是唐天一，参与人有成晓雪；
第八章的负责人是李军毅和陈志朋；
第九章的负责人是陈昱硕、刘沛羽和唐天一，参与人有周昆；
第十章的负责人是李军毅、汤昕宇和都一凡，参与人有王晓磊；
第十一章的负责人是任瑞阳和蒋锦昊，参与人有李军毅；
第十二章的负责人是张北辰和周昆，参与人有张高玮；
第十三章的负责人是周昆，参与人（按拼音字母排序）有蒋锦昊、李依凡、刘子康、孙文奇、王禹淏、徐澜玲、杨锦霞和郑博文。

同时感谢其他参与本书编写、校对的同学，他们（按拼音字母排序）是曹乾、曹展硕、陈杰、程伽雅琪、戴孙浩、邓欣、丁毅杰、冯雪扬、高泽峰、苟志斌、辜子惠、郭歌扬、何东楠、侯新铭、胡译文、李炳黔、李成远、李欣潼、刘恩泽、刘炯楠、刘子涵、罗文扬、梅朗、欧柯杉、彭涵、阮恺、苏炜航、孙一丁、汤奕如、王家鹏、王磊、王淑婷、姚峰、尹彦彬、詹玉梁、张景森、张良、朱天宇和朱余韬。

本书在编写过程得到了中国人民大学大型科学仪器共享平台的算力资源支持，在此对于陈跃国、鲁蔚征、石源三位老师表示衷心的感谢。

本书封面图片由AI工具辅助进行生成，制作人为徐澜玲。

反馈意见

在准备中文书的过程中，我们广泛阅读了现有的经典论文、相关代码和教材，从中提炼出核心概念、主流算法与模型，并进行了系统性的组织与介绍。我们对于每个章节的内容初稿都进行了多次修正，力求表达的清晰性与准确性。然而，在书写过程中，我们深感自身能力与知识的局限性，尽管已经付出了极大的努力，但难免会有遗漏或不当之处。本书的初版仅是一个起点，我们计划在网上持续进行内容的更新和完善，并特别欢迎读者提出宝贵的批评与建议，也会同步在网站上对于提出宝贵建议的读者进行致谢。我们将编写此书的过程当做一个自身的学习过程，也希望能够通过本书与读者进行深入交流，向更多的行业同行学习。

如果您有任何意见、评论以及建议（先确认最新版本中是否已经修正），请通过GitHub的Issues页面进行反馈，或通过邮件发送至作者邮箱batmanfly at qq.com，lijunyi at ruc.edu.cn，francis_kun_zhou at ruc.edu.cn，steventianyitang at outlook.com。

For Tasks:

Click tags to check more tools for each tasks

text generation machine translation question answering

For Jobs:

research scientist machine learning engineer data scientist ai engineer software engineer

Alternative AI tools for LLMBook-zh.github.io

Similar Open Source Tools

LLMBook-zh.github.io

github

: 242

AiLearning-Theory-Applying

This repository provides a comprehensive guide to understanding and applying artificial intelligence (AI) theory, including basic knowledge, machine learning, deep learning, and natural language processing (BERT). It features detailed explanations, annotated code, and datasets to help users grasp the concepts and implement them in practice. The repository is continuously updated to ensure the latest information and best practices are covered.

github

: 2.9k

SQLAgent

DataAgent is a multi-agent system for data analysis, capable of understanding data development and data analysis requirements, understanding data, and generating SQL and Python code for tasks such as data query, data visualization, and machine learning.

github

: 2.1k

kweaver

KWeaver is an open-source cognitive intelligence development framework that provides data scientists, application developers, and domain experts with the ability for rapid development, comprehensive openness, and high-performance knowledge network generation and cognitive intelligence large model framework. It offers features such as automated and visual knowledge graph construction, visualization and analysis of knowledge graph data, knowledge graph integration, knowledge graph resource management, large model prompt engineering and debugging, and visual configuration for large model access.

github

: 60

airda

airda(Air Data Agent) is a multi-agent system for data analysis, which can understand data development and data analysis requirements, understand data, and generate SQL and Python code for data query, data visualization, machine learning and other tasks.

github

: 2.1k

hugging-llm

HuggingLLM is a project that aims to introduce ChatGPT to a wider audience, particularly those interested in using the technology to create new products or applications. The project focuses on providing practical guidance on how to use ChatGPT-related APIs to create new features and applications. It also includes detailed background information and system design introductions for relevant tasks, as well as example code and implementation processes. The project is designed for individuals with some programming experience who are interested in using ChatGPT for practical applications, and it encourages users to experiment and create their own applications and demos.

github

: 2.6k

unilm

The 'unilm' repository is a collection of tools, models, and architectures for Foundation Models and General AI, focusing on tasks such as NLP, MT, Speech, Document AI, and Multimodal AI. It includes various pre-trained models, such as UniLM, InfoXLM, DeltaLM, MiniLM, AdaLM, BEiT, LayoutLM, WavLM, VALL-E, and more, designed for tasks like language understanding, generation, translation, vision, speech, and multimodal processing. The repository also features toolkits like s2s-ft for sequence-to-sequence fine-tuning and Aggressive Decoding for efficient sequence-to-sequence decoding. Additionally, it offers applications like TrOCR for OCR, LayoutReader for reading order detection, and XLM-T for multilingual NMT.

github

: 19.4k

chatwiki

ChatWiki is an open-source knowledge base AI question-answering system. It is built on large language models (LLM) and retrieval-augmented generation (RAG) technologies, providing out-of-the-box data processing, model invocation capabilities, and helping enterprises quickly build their own knowledge base AI question-answering systems. It offers exclusive AI question-answering system, easy integration of models, data preprocessing, simple user interface design, and adaptability to different business scenarios.

github

: 97

MedLLMsPracticalGuide

This repository serves as a practical guide for Medical Large Language Models (Medical LLMs) and provides resources, surveys, and tools for building, fine-tuning, and utilizing LLMs in the medical domain. It covers a wide range of topics including pre-training, fine-tuning, downstream biomedical tasks, clinical applications, challenges, future directions, and more. The repository aims to provide insights into the opportunities and challenges of LLMs in medicine and serve as a practical resource for constructing effective medical LLMs.

github

: 762

modern_ai_for_beginners

This repository provides a comprehensive guide to modern AI for beginners, covering both theoretical foundations and practical implementation. It emphasizes the importance of understanding both the mathematical principles and the code implementation of AI models. The repository includes resources on PyTorch, deep learning fundamentals, mathematical foundations, transformer-based LLMs, diffusion models, software engineering, and full-stack development. It also features tutorials on natural language processing with transformers, reinforcement learning, and practical deep learning for coders.

github

: 83

human

AI-powered 3D Face Detection & Rotation Tracking, Face Description & Recognition, Body Pose Tracking, 3D Hand & Finger Tracking, Iris Analysis, Age & Gender & Emotion Prediction, Gaze Tracking, Gesture Recognition, Body Segmentation

github

: 2.0k

data-scientist-roadmap2024

The Data Scientist Roadmap2024 provides a comprehensive guide to mastering essential tools for data science success. It includes programming languages, machine learning libraries, cloud platforms, and concepts categorized by difficulty. The roadmap covers a wide range of topics from programming languages to machine learning techniques, data visualization tools, and DevOps/MLOps tools. It also includes web development frameworks and specific concepts like supervised and unsupervised learning, NLP, deep learning, reinforcement learning, and statistics. Additionally, it delves into DevOps tools like Airflow and MLFlow, data visualization tools like Tableau and Matplotlib, and other topics such as ETL processes, optimization algorithms, and financial modeling.

github

: 105

paper-ai

Paper-ai is a tool that helps you write papers using artificial intelligence. It provides features such as AI writing assistance, reference searching, and editing and formatting tools. With Paper-ai, you can quickly and easily create high-quality papers.

github

: 664

HTFramework

HTFramework is a rapid development framework based on Unity, integrating modular requirements, code reusability, practicality, high cohesion, unified coding standards, extensibility, maintainability, generality, and pluggability. It provides continuous maintenance and upgrades. The framework includes modules for aspect-oriented program code tracking, audio management, controller simplification, coroutine scheduling, custom modules, custom datasets, debugging, entity-component-system, entity management, event handling, exception handling, finite state machines, hotfixing, input management, instruction system, main module access, network client, object pooling, procedures, reference pooling, resource loading, step editing, task editing, UI management, utility tools, web requests, and optional AI, ILRuntime-based hotfixing, XLua integration, and game component modules.

github

: 651

xtuner

XTuner is an efficient, flexible, and full-featured toolkit for fine-tuning large models. It supports various LLMs (InternLM, Mixtral-8x7B, Llama 2, ChatGLM, Qwen, Baichuan, ...), VLMs (LLaVA), and various training algorithms (QLoRA, LoRA, full-parameter fine-tune). XTuner also provides tools for chatting with pretrained / fine-tuned LLMs and deploying fine-tuned LLMs with any other framework, such as LMDeploy.

github

: 3.5k

instill-core

Instill Core is an open-source orchestrator comprising a collection of source-available projects designed to streamline every aspect of building versatile AI features with unstructured data. It includes Instill VDP (Versatile Data Pipeline) for unstructured data, AI, and pipeline orchestration, Instill Model for scalable MLOps and LLMOps for open-source or custom AI models, and Instill Artifact for unified unstructured data management. Instill Core can be used for tasks such as building, testing, and sharing pipelines, importing, serving, fine-tuning, and monitoring ML models, and transforming documents, images, audio, and video into a unified AI-ready format.

github

: 2.1k

For similar tasks

awesome-transformer-nlp

This repository contains a hand-curated list of great machine (deep) learning resources for Natural Language Processing (NLP) with a focus on Generative Pre-trained Transformer (GPT), Bidirectional Encoder Representations from Transformers (BERT), attention mechanism, Transformer architectures/networks, Chatbot, and transfer learning in NLP.

github

: 1.1k

LLMs-from-scratch

This repository contains the code for coding, pretraining, and finetuning a GPT-like LLM and is the official code repository for the book Build a Large Language Model (From Scratch). In _Build a Large Language Model (From Scratch)_, you'll discover how LLMs work from the inside out. In this book, I'll guide you step by step through creating your own LLM, explaining each stage with clear text, diagrams, and examples. The method described in this book for training and developing your own small-but-functional model for educational purposes mirrors the approach used in creating large-scale foundational models such as those behind ChatGPT.

github

: 26.9k

PaddleNLP

PaddleNLP is an easy-to-use and high-performance NLP library. It aggregates high-quality pre-trained models in the industry and provides out-of-the-box development experience, covering a model library for multiple NLP scenarios with industry practice examples to meet developers' flexible customization needs.

github

: 12.0k

Tutorial

The Bookworm·Puyu large model training camp aims to promote the implementation of large models in more industries and provide developers with a more efficient platform for learning the development and application of large models. Within two weeks, you will learn the entire process of fine-tuning, deploying, and evaluating large models.

github

: 1.3k

llms-from-scratch-cn

This repository provides a detailed tutorial on how to build your own large language model (LLM) from scratch. It includes all the code necessary to create a GPT-like LLM, covering the encoding, pre-training, and fine-tuning processes. The tutorial is written in a clear and concise style, with plenty of examples and illustrations to help you understand the concepts involved. It is suitable for developers and researchers with some programming experience who are interested in learning more about LLMs and how to build them.

github

: 860

LLMBook-zh.github.io

github

: 242

LLM-Blender

LLM-Blender is a framework for ensembling large language models (LLMs) to achieve superior performance. It consists of two modules: PairRanker and GenFuser. PairRanker uses pairwise comparisons to distinguish between candidate outputs, while GenFuser merges the top-ranked candidates to create an improved output. LLM-Blender has been shown to significantly surpass the best LLMs and baseline ensembling methods across various metrics on the MixInstruct benchmark dataset.

github

: 786

SeaLLMs

SeaLLMs are a family of language models optimized for Southeast Asian (SEA) languages. They were pre-trained from Llama-2, on a tailored publicly-available dataset, which comprises texts in Vietnamese 🇻🇳, Indonesian 🇮🇩, Thai 🇹🇭, Malay 🇲🇾, Khmer🇰🇭, Lao🇱🇦, Tagalog🇵🇭 and Burmese🇲🇲. The SeaLLM-chat underwent supervised finetuning (SFT) and specialized self-preferencing DPO using a mix of public instruction data and a small number of queries used by SEA language native speakers in natural settings, which **adapt to the local cultural norms, customs, styles and laws in these areas**. SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities relative to comparable open-source models. Moreover, they outperform **ChatGPT-3.5** in non-Latin languages, such as Thai, Khmer, Lao, and Burmese.

github

: 130

For similar jobs

sweep

Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

github

: 7.1k

teams-ai

The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

github

: 390

ai-guide

This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

github

: 159

classifai

Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

github

: 565

chatbot-ui

Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

github

: 27.7k

BricksLLM

BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

github

: 868

uAgents

uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

github

: 833

griptape

Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.

github

: 1.9k