sarathi-serve

sarathi-serve

A low-latency & high-throughput serving engine for LLMs

Stars: 112

Visit
 screenshot

Sarathi-Serve is the official OSDI'24 artifact submission for paper #444, focusing on 'Taming Throughput-Latency Tradeoff in LLM Inference'. It is a research prototype built on top of CUDA 12.1, designed to optimize throughput-latency tradeoff in Large Language Models (LLM) inference. The tool provides a Python environment for users to install and reproduce results from the associated experiments. Users can refer to specific folders for individual figures and are encouraged to cite the paper if they use the tool in their work.

README:

Sarathi-Serve

This is the official OSDI'24 artifact submission for paper #444, "Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve”.

Setup

Setup CUDA

Sarathi-Serve has been tested with CUDA 12.1 on A100 and A40 GPUs.

Clone repository

git clone [email protected]:microsoft/sarathi-serve.git

Create mamba environment

Setup mamba if you don't already have it,

wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
bash Mambaforge-Linux-x86_64.sh # follow the instructions from there

Create a Python 3.10 environment,

mamba create -p ./env python=3.10  

Install Sarathi-Serve

pip install -e . --extra-index-url https://flashinfer.ai/whl/cu121/torch2.3/

Reproducing Results

Refer to readmes in individual folders corresponding to each figure in osdi-experiments.

Citation

If you use our work, please consider citing our paper:

@article{agrawal2024taming,
  title={Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve},
  author={Agrawal, Amey and Kedia, Nitin and Panwar, Ashish and Mohan, Jayashree and Kwatra, Nipun and Gulavani, Bhargav S and Tumanov, Alexey and Ramjee, Ramachandran},
  journal={Proceedings of 18th USENIX Symposium on Operating Systems Design and Implementation, 2024, Santa Clara},
  year={2024}
}

Acknowledgment

This repository originally started as a fork of the vLLM project. Sarathi-Serve is a research prototype and does not have complete feature parity with open-source vLLM. We have only retained the most critical features and adopted the codebase for faster research iterations.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for sarathi-serve

Similar Open Source Tools

For similar tasks

For similar jobs