xllm-service

xllm-service

A flexible serving framework that delivers efficient and fault-tolerant LLM inference for clustered deployments.

Stars: 86

Visit
 screenshot

xLLM-service is a service-layer framework developed based on the xLLM inference engine, providing efficient, fault-tolerant, and flexible LLM inference services for clustered deployment. It addresses challenges in enterprise-level service scenarios such as ensuring SLA of online services, improving resource utilization, reacting to changing request loads, resolving performance bottlenecks, and ensuring high reliability of computing instances. With features like unified scheduling, adaptive dynamic allocation, EPD three-stage disaggregation, and fault-tolerant architecture, xLLM-service offers efficient and reliable LLM inference services.

README:

English | 中文

xLLM

1. Project Overview

xLLM-service is a service-layer framework developed based on the xLLM inference engine, providing efficient, fault-tolerant, and flexible LLM inference services for clustered deployment.

xLLM-service targets to address key challenges in enterprise-level service scenarios:

  • How to ensure the SLA of online services and improve resource utilization of offline tasks in a hybrid online-offline deployment environment.

  • How to react to changing request loads in actual businesses, such as fluctuations in input/output lengths.

  • Resolving performance bottlenecks of multimodal model requests.

  • Ensuring high reliability of computing instances.


2. Key Features

With management of computing resource pools, intelligent scheduling and preemption of hybrid requests, and real-time monitoring of computing instances, xLLM-service achieves the following key features:

  • Unified scheduling of online and offline requests, with preemptive execution for online requests and best-effort execution for offline requests.

  • Adaptive dynamic allocation of PD ratios, supporting efficient switching of instance PD roles.

  • EPD three-stage disaggregation for multimodal requests, with intelligent resource allocation for different stages.

  • Fault-tolerant architecture, fast detection of instance error and automatic rescheduling for interrupted requests.


3. Core Architecture

├── xllm-service/
|   : main source folder
│   ├── chat_template/               # 
│   ├── common/                      # 
│   ├── examples/                    # 
│   ├── http_service/                # 
│   ├── rpc_service/                 # 
|   ├── tokenizers/                  #
|   └── master.cpp                   # 

4. Quick Start

Installation

git clone [email protected]:xllm-ai/xllm_service.git
cd xllm_service
git submodule init
git submodule update

Compilation

compile xllm-service:

sh prepare.sh # apply patch
mkdir -p build && cd build
cmake .. && make -j 8

5. Contributing

There are several ways you can contribute to xLLM:

  1. Reporting Issues (Bugs & Errors)
  2. Suggesting Enhancements
  3. Improving Documentation
    • Fork the repository
    • Add your view in document
    • Send your pull request
  4. Writing Code
    • Fork the repository
    • Create a new branch
    • Add your feature or improvement
    • Send your pull request

We appreciate all kinds of contributions! 🎉🎉🎉 If you have problems about development, please check our document: * Document


6. Community & Support

If you encounter any issues along the way, you are welcomed to submit reproducible steps and log snippets in the project's Issues area, or contact the xLLM Core team directly via your internal Slack.

Welcome to contact us:

qrcode1 qrcode2

7. About the Contributors

Thanks to all the following developers who have contributed to xLLM.


8. License

Apache License

xLLM is provided by JD.com

Thanks for your Contributions!

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for xllm-service

Similar Open Source Tools

For similar tasks

For similar jobs