gpt_server
gpt_server是一个用于生产级部署LLMs或Embedding的开源框架。
Stars: 111
The GPT Server project leverages the basic capabilities of FastChat to provide the capabilities of an openai server. It perfectly adapts more models, optimizes models with poor compatibility in FastChat, and supports loading vllm, LMDeploy, and hf in various ways. It also supports all sentence_transformers compatible semantic vector models, including Chat templates with function roles, Function Calling (Tools) capability, and multi-modal large models. The project aims to reduce the difficulty of model adaptation and project usage, making it easier to deploy the latest models with minimal code changes.
README:
本项目依托fastchat的基础能力来提供openai server的能力.
- 在此基础上完美适配了更多的模型,优化了fastchat兼容较差的模型
- 支持了Function Calling (Tools) 能力(现阶段支持Qwen/ChatGLM,对Qwen支持更好)
- 重新适配了vllm对模型适配较差,导致解码内容和hf不对齐的问题。
- 支持了vllm、LMDeploy和hf的加载方式
- 支持所有兼容sentence_transformers的语义向量模型(Embedding和Reranker)
- 支持了Infinity后端,推理速度大于onnx/tensorrt,支持动态组批
- Chat模板无角色限制,使其完美支持了LangGraph Agent框架
- 支持多模态大模型
- 降低了模型适配的难度和项目使用的难度(新模型的适配仅需修改低于5行代码),从而更容易的部署自己最新的模型。
(仓库初步构建中,构建过程中没有经过完善的回归测试,可能会发生已适配的模型不可用的Bug,欢迎提出改进或者适配模型的建议意见。)
- 支持多种推理后端引擎,vLLM和LMDeploy,LMDeploy后端引擎,每秒处理的请求数是 vLLM 的 1.36 ~ 1.85 倍
- 支持了Infinity后端,推理速度大于onnx/tensorrt,支持动态组批
- 全球唯一完美支持Tools(Function Calling)功能的开源框架。兼容LangChain的 bind_tools、AgentExecutor、with_structured_output写法(目前支持Qwen系列、GLM系列)
- 全球唯一扩展了openai库,实现Reranker模型。(代码样例见gpt_server/tests/test_openai_rerank.py)
- 支持多模态大模型
- 与FastChat相同的分布式架构
2024-10-15 支持了 Qwen2-VL
2024-9-19 支持了 minicpmv 模型
2024-8-17 支持了 vllm/hf 后端的 lora 部署
2024-8-14 支持了 InternVL2 系列多模态模型
2024-7-28 支持embedding/reranker 的动态组批加速(infinity后端, 比onnx/tensorrt更快)
2024-7-19 支持了多模态模型 glm-4v-gb 的LMDeploy PyTorch后端
2024-6-22 支持了 Qwen系列、ChatGLM系列 function call (tools) 能力
2024-6-12 支持了 qwen-2
2024-6-5 支持了 Yinka、zpoint_large_embedding_zh 嵌入模型
2024-6-5 支持了 glm4-9b系列(hf和vllm)
2024-4-27 支持了 LMDeploy 加速推理后端
2024-4-20 支持了 llama-3
2024-4-13 支持了 deepseek
2024-4-4 支持了 embedding模型 acge_text_embedding
2024-3-9 支持了 reranker 模型 ( bge-reranker,bce-reranker-base_v1)
2024-3-3 支持了 internlm-1.0 ,internlm-2.0
2024-3-2 支持了 qwen-1.5 0.5B, 1.8B, 4B, 7B, 14B, and 72B
2024-2-4 支持了 vllm 实现
2024-1-6 支持了 Yi-34B
2023-12-31 支持了 qwen-7b, qwen-14b
2023-12-30 支持了 all-embedding(理论上支持所有的词嵌入模型)
2023-12-24 支持了 chatglm3-6b
- [X] 支持HF后端
- [X] 支持vLLM后端
- [X] 支持LMDeploy后端
- [X] 支持 function call 功能 (tools)(Qwen系列、ChatGLM系列已经支持,后面有需求再继续扩展)
- [X] 支持多模态模型(初步支持glm-4v,其它模型后续慢慢支持)
- [X] 支持Embedding模型动态组批(实现方式:infinity后端)
- [X] 支持Reranker模型动态组批(实现方式:infinity后端)
- [ ] 可视化启动界面
- [ ] 内置部分 tools (image_gen,code_interpreter,weather等)
- [ ] 并行的function call功能(tools)
# 1. 创建conda 环境
conda create -n gpt_server python=3.10
# 2. 激活conda 环境
conda activate gpt_server
# 3. 安装仓库(一定要使用 install.sh 安装,否则无法解决依赖冲突)
sh install.sh
修改模型后端方式(vllm,lmdeploy等)
config.yaml中:
work_mode: vllm # vllm hf lmdeploy-turbomind lmdeploy-pytorch
修改embedding/reranker后端方式(embedding或embedding_infinity)
config.yaml中:
model_type: embedding_infinity # embedding 或 embedding_infinity embedding_infinity后端速度远远大于 embedding
cd gpt_server/script
vim config.yaml
serve_args: # openai 服务的 host 和 pot
host: 0.0.0.0
port: 8082
controller_address: http://localhost:21001 # 控制器的ip地址
# api_keys: 111,222 # 用来设置 openai 密钥
# controller
controller_args: # 控制器的配置参数
host: 0.0.0.0
port: 21001
dispatch_method: shortest_queue # lottery、shortest_queue # 现有两种请求分发策略,随机(lottery) 和 最短队列(shortest_queue),最短队列方法更推荐。
# model worker
model_worker_args: # 模型的配置参数,这里port 不能设置,程序自动分配,并注册到 控制器中。
host: 0.0.0.0
controller_address: http://localhost:21001 # 将模型注册到 控制器的 地址
models:
- chatglm4: #自定义的模型名称
alias: null # 别名 例如 gpt4,gpt3
enable: true # false true 控制是否启动模型worker
model_name_or_path: /home/dev/model/THUDM/glm-4-9b-chat/
model_type: chatglm # qwen yi internlm
work_mode: vllm # vllm hf lmdeploy-turbomind lmdeploy-pytorch
# lora: # lora 配置
# test_lora: /home/dev/project/LLaMA-Factory/saves/Qwen1.5-14B-Chat/lora/train_2024-03-22-09-01-32/checkpoint-100
device: gpu # gpu / cpu
workers:
- gpus:
# - 1
- 0
# - gpus: 表示 模型使用 gpu[0,1],默认使用的 TP(张量并行)
# - 0
# - 1
# - gpus: 表示启动两个模型,模型副本1加载到 0卡, 模型副本2 加载到 1卡
# - 0
# - gpus:
# - 1
- qwen: #自定义的模型名称
alias: gpt-4,gpt-3.5-turbo,gpt-3.5-turbo-16k # 别名 例如 gpt4,gpt3
enable: true # false true 控制是否启动模型worker
model_name_or_path: /home/dev/model/qwen/Qwen1___5-14B-Chat/
model_type: qwen # qwen yi internlm
work_mode: vllm # vllm hf lmdeploy-turbomind lmdeploy-pytorch
device: gpu # gpu / cpu
workers:
- gpus:
- 1
# - gpus:
# - 3
# Embedding 模型
- bge-base-zh:
alias: null # 别名
enable: true # false true
model_name_or_path: /home/dev/model/Xorbits/bge-base-zh-v1___5/
model_type: embedding_infinity # embedding_infinity
work_mode: hf
device: gpu # gpu / cpu
workers:
- gpus:
- 2
# reranker 模型
- bge-reranker-base:
alias: null # 别名
enable: true # false true 控制是否启动模型worker
model_name_or_path: /home/dev/model/Xorbits/bge-reranker-base/
model_type: embedding_infinity # embedding_infinity
work_mode: hf
device: gpu # gpu / cpu
workers:
- gpus:
- 2
cd gpt_server/script
sh start.sh
推理速度: LMDeploy TurboMind > vllm > LMDeploy PyTorch > HF
Models / BackEnd | model_type | HF | vllm | LMDeploy TurboMind | LMDeploy PyTorch |
---|---|---|---|---|---|
chatglm4-9b | chatglm | √ | √ | √ | √ |
chatglm3-6b | chatglm | √ | √ | × | √ |
Qwen (7B, 14B, etc.)) | qwen | √ | √ | √ | √ |
Qwen-1.5 (0.5B--72B) | qwen | √ | √ | √ | √ |
Qwen-2 | qwen | √ | √ | √ | √ |
Qwen-2.5 | qwen | √ | √ | √ | √ |
Yi-34B | yi | √ | √ | √ | √ |
Internlm-1.0 | internlm | √ | √ | √ | √ |
Internlm-2.0 | internlm | √ | √ | √ | √ |
Deepseek | deepseek | √ | √ | √ | √ |
Llama-3 | llama | √ | √ | √ | √ |
Baichuan-2 | baichuan | √ | √ | √ | √ |
VLM (视觉大模型榜单 https://rank.opencompass.org.cn/leaderboard-multimodal)
Models / BackEnd | model_type | HF | vllm | LMDeploy TurboMind | LMDeploy PyTorch |
---|---|---|---|---|---|
glm-4v-9b | chatglm | × | × | × | √ |
InternVL2 | internvl2 | × | × | √ | √ |
MiniCPM-V-2_6 | minicpmv | × | × | √ | × |
Qwen2-VL | qwen | × | √ | × | × |
原则上支持所有的Embedding/Rerank 模型
推理速度: Infinity >> HF
以下模型经过测试可放心使用:
Embedding/Rerank | HF | Infinity |
---|---|---|
bge-reranker | √ | √ |
bce-reranker | √ | √ |
bge-embedding | √ | √ |
bce-embedding | √ | √ |
puff | √ | √ |
piccolo-base-zh-embedding | √ | √ |
acge_text_embedding | √ | √ |
Yinka | √ | √ |
zpoint_large_embedding_zh | √ | √ |
xiaobu-embedding | √ | √ |
Conan-embedding-v1 | √ | √ |
目前 TencentBAC的 Conan-embedding-v1 C-MTEB榜单排行第一(MTEB: https://huggingface.co/spaces/mteb/leaderboard)
见 gpt_server/tests 目录 样例测试代码: https://github.com/shell-nlp/gpt_server/tree/main/tests
cd gpt_server/tests
python web_demo.py
WebUI界面:
docker pull 506610466/gpt_server:latest # 如果拉取失败可尝试下面的方式
# 如果国内无法拉取docker镜像,可以尝试下面的国内镜像拉取的方式(不保证国内镜像源一直可用)
docker pull docker.rainbond.cc/506610466/gpt_server:latest
docker build --rm -f "Dockerfile" -t gpt_server:latest "."
docker-compose -f "docker-compose.yml" up -d --build gpt_server
FastChat : https://github.com/lm-sys/FastChat
vLLM : https://github.com/vllm-project/vllm
LMDeploy : https://github.com/InternLM/lmdeploy
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for gpt_server
Similar Open Source Tools
gpt_server
The GPT Server project leverages the basic capabilities of FastChat to provide the capabilities of an openai server. It perfectly adapts more models, optimizes models with poor compatibility in FastChat, and supports loading vllm, LMDeploy, and hf in various ways. It also supports all sentence_transformers compatible semantic vector models, including Chat templates with function roles, Function Calling (Tools) capability, and multi-modal large models. The project aims to reduce the difficulty of model adaptation and project usage, making it easier to deploy the latest models with minimal code changes.
video-subtitle-remover
Video-subtitle-remover (VSR) is a software based on AI technology that removes hard subtitles from videos. It achieves the following functions: - Lossless resolution: Remove hard subtitles from videos, generate files with subtitles removed - Fill the region of removed subtitles using a powerful AI algorithm model (non-adjacent pixel filling and mosaic removal) - Support custom subtitle positions, only remove subtitles in defined positions (input position) - Support automatic removal of all text in the entire video (no input position required) - Support batch removal of watermark text from multiple images.
MedicalGPT
MedicalGPT is a training medical GPT model with ChatGPT training pipeline, implement of Pretraining, Supervised Finetuning, RLHF(Reward Modeling and Reinforcement Learning) and DPO(Direct Preference Optimization).
Llama-Chinese
Llama中文社区是一个专注于Llama模型在中文方面的优化和上层建设的高级技术社区。 **已经基于大规模中文数据,从预训练开始对Llama2模型进行中文能力的持续迭代升级【Done】**。**正在对Llama3模型进行中文能力的持续迭代升级【Doing】** 我们热忱欢迎对大模型LLM充满热情的开发者和研究者加入我们的行列。
DISC-LawLLM
DISC-LawLLM is a legal domain large model that aims to provide professional, intelligent, and comprehensive **legal services** to users. It is developed and open-sourced by the Data Intelligence and Social Computing Lab (Fudan-DISC) at Fudan University.
wenda
Wenda is a platform for large-scale language model invocation designed to efficiently generate content for specific environments, considering the limitations of personal and small business computing resources, as well as knowledge security and privacy issues. The platform integrates capabilities such as knowledge base integration, multiple large language models for offline deployment, auto scripts for additional functionality, and other practical capabilities like conversation history management and multi-user simultaneous usage.
Firefly
Firefly is an open-source large model training project that supports pre-training, fine-tuning, and DPO of mainstream large models. It includes models like Llama3, Gemma, Qwen1.5, MiniCPM, Llama, InternLM, Baichuan, ChatGLM, Yi, Deepseek, Qwen, Orion, Ziya, Xverse, Mistral, Mixtral-8x7B, Zephyr, Vicuna, Bloom, etc. The project supports full-parameter training, LoRA, QLoRA efficient training, and various tasks such as pre-training, SFT, and DPO. Suitable for users with limited training resources, QLoRA is recommended for fine-tuning instructions. The project has achieved good results on the Open LLM Leaderboard with QLoRA training process validation. The latest version has significant updates and adaptations for different chat model templates.
awesome-ai-painting
This repository, named 'awesome-ai-painting', is a comprehensive collection of resources related to AI painting. It is curated by a user named 秋风, who is an AI painting enthusiast with a background in the AIGC industry. The repository aims to help more people learn AI painting and also documents the user's goal of creating 100 AI products, with current progress at 4/100. The repository includes information on various AI painting products, tutorials, tools, and models, providing a valuable resource for individuals interested in AI painting and related technologies.
pmhub
PmHub is a smart project management system based on SpringCloud, SpringCloud Alibaba, and LLM. It aims to help students quickly grasp the architecture design and development process of microservices/distributed projects. PmHub provides a platform for students to experience the transformation from monolithic to microservices architecture, understand the pros and cons of both architectures, and prepare for job interviews. It offers popular technologies like SpringCloud-Gateway, Nacos, Sentinel, and provides high-quality code, continuous integration, product design documents, and an enterprise workflow system. PmHub is suitable for beginners and advanced learners who want to master core knowledge of microservices/distributed projects.
go-cyber
Cyber is a superintelligence protocol that aims to create a decentralized and censorship-resistant internet. It uses a novel consensus mechanism called CometBFT and a knowledge graph to store and process information. Cyber is designed to be scalable, secure, and efficient, and it has the potential to revolutionize the way we interact with the internet.
TigerBot
TigerBot is a cutting-edge foundation for your very own LLM, providing a world-class large model for innovative Chinese-style contributions. It offers various upgrades and features, such as search mode enhancements, support for large context lengths, and the ability to play text-based games. TigerBot is suitable for prompt-based game engine development, interactive game design, and real-time feedback for playable games.
MiniCPM
MiniCPM is a series of open-source large models on the client side jointly developed by Face Intelligence and Tsinghua University Natural Language Processing Laboratory. The main language model MiniCPM-2B has only 2.4 billion (2.4B) non-word embedding parameters, with a total of 2.7B parameters. - After SFT, MiniCPM-2B performs similarly to Mistral-7B on public comprehensive evaluation sets (better in Chinese, mathematics, and code capabilities), and outperforms models such as Llama2-13B, MPT-30B, and Falcon-40B overall. - After DPO, MiniCPM-2B also surpasses many representative open-source large models such as Llama2-70B-Chat, Vicuna-33B, Mistral-7B-Instruct-v0.1, and Zephyr-7B-alpha on the current evaluation set MTBench, which is closest to the user experience. - Based on MiniCPM-2B, a multi-modal large model MiniCPM-V 2.0 on the client side is constructed, which achieves the best performance of models below 7B in multiple test benchmarks, and surpasses larger parameter scale models such as Qwen-VL-Chat 9.6B, CogVLM-Chat 17.4B, and Yi-VL 34B on the OpenCompass leaderboard. MiniCPM-V 2.0 also demonstrates leading OCR capabilities, approaching Gemini Pro in scene text recognition capabilities. - After Int4 quantization, MiniCPM can be deployed and inferred on mobile phones, with a streaming output speed slightly higher than human speech speed. MiniCPM-V also directly runs through the deployment of multi-modal large models on mobile phones. - A single 1080/2080 can efficiently fine-tune parameters, and a single 3090/4090 can fully fine-tune parameters. A single machine can continuously train MiniCPM, and the secondary development cost is relatively low.
ai-hub
AI Hub Project aims to continuously test and evaluate mainstream large language models, while accumulating and managing various effective model invocation prompts. It has integrated all mainstream large language models in China, including OpenAI GPT-4 Turbo, Baidu ERNIE-Bot-4, Tencent ChatPro, MiniMax abab5.5-chat, and more. The project plans to continuously track, integrate, and evaluate new models. Users can access the models through REST services or Java code integration. The project also provides a testing suite for translation, coding, and benchmark testing.
MindChat
MindChat is a psychological large language model designed to help individuals relieve psychological stress and solve mental confusion, ultimately improving mental health. It aims to provide a relaxed and open conversation environment for users to build trust and understanding. MindChat offers privacy, warmth, safety, timely, and convenient conversation settings to help users overcome difficulties and challenges, achieve self-growth, and development. The tool is suitable for both work and personal life scenarios, providing comprehensive psychological support and therapeutic assistance to users while strictly protecting user privacy. It combines psychological knowledge with artificial intelligence technology to contribute to a healthier, more inclusive, and equal society.
HivisionIDPhotos
HivisionIDPhoto is a practical algorithm for intelligent ID photo creation. It utilizes a comprehensive model workflow to recognize, cut out, and generate ID photos for various user photo scenarios. The tool offers lightweight cutting, standard ID photo generation based on different size specifications, six-inch layout photo generation, beauty enhancement (waiting), and intelligent outfit swapping (waiting). It aims to solve emergency ID photo creation issues.
llms-from-scratch-cn
This repository provides a detailed tutorial on how to build your own large language model (LLM) from scratch. It includes all the code necessary to create a GPT-like LLM, covering the encoding, pre-training, and fine-tuning processes. The tutorial is written in a clear and concise style, with plenty of examples and illustrations to help you understand the concepts involved. It is suitable for developers and researchers with some programming experience who are interested in learning more about LLMs and how to build them.
For similar tasks
ai-on-gke
This repository contains assets related to AI/ML workloads on Google Kubernetes Engine (GKE). Run optimized AI/ML workloads with Google Kubernetes Engine (GKE) platform orchestration capabilities. A robust AI/ML platform considers the following layers: Infrastructure orchestration that support GPUs and TPUs for training and serving workloads at scale Flexible integration with distributed computing and data processing frameworks Support for multiple teams on the same infrastructure to maximize utilization of resources
ray
Ray is a unified framework for scaling AI and Python applications. It consists of a core distributed runtime and a set of AI libraries for simplifying ML compute, including Data, Train, Tune, RLlib, and Serve. Ray runs on any machine, cluster, cloud provider, and Kubernetes, and features a growing ecosystem of community integrations. With Ray, you can seamlessly scale the same code from a laptop to a cluster, making it easy to meet the compute-intensive demands of modern ML workloads.
labelbox-python
Labelbox is a data-centric AI platform for enterprises to develop, optimize, and use AI to solve problems and power new products and services. Enterprises use Labelbox to curate data, generate high-quality human feedback data for computer vision and LLMs, evaluate model performance, and automate tasks by combining AI and human-centric workflows. The academic & research community uses Labelbox for cutting-edge AI research.
djl
Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. It is designed to be easy to get started with and simple to use for Java developers. DJL provides a native Java development experience and allows users to integrate machine learning and deep learning models with their Java applications. The framework is deep learning engine agnostic, enabling users to switch engines at any point for optimal performance. DJL's ergonomic API interface guides users with best practices to accomplish deep learning tasks, such as running inference and training neural networks.
mlflow
MLflow is a platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models. MLflow offers a set of lightweight APIs that can be used with any existing machine learning application or library (TensorFlow, PyTorch, XGBoost, etc), wherever you currently run ML code (e.g. in notebooks, standalone applications or the cloud). MLflow's current components are:
* `MLflow Tracking
tt-metal
TT-NN is a python & C++ Neural Network OP library. It provides a low-level programming model, TT-Metalium, enabling kernel development for Tenstorrent hardware.
burn
Burn is a new comprehensive dynamic Deep Learning Framework built using Rust with extreme flexibility, compute efficiency and portability as its primary goals.
awsome-distributed-training
This repository contains reference architectures and test cases for distributed model training with Amazon SageMaker Hyperpod, AWS ParallelCluster, AWS Batch, and Amazon EKS. The test cases cover different types and sizes of models as well as different frameworks and parallel optimizations (Pytorch DDP/FSDP, MegatronLM, NemoMegatron...).
For similar jobs
sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.
teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.
chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.
BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students
uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.
griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.