litserve
Deploy AI models at scale. High-throughput serving engine for AI/ML models that uses the latest state-of-the-art model deployment techniques.
Stars: 53
LitServe is a high-throughput serving engine for deploying AI models at scale. It generates an API endpoint for a model, handles batching, streaming, autoscaling across CPU/GPUs, and more. Built for enterprise scale, it supports every framework like PyTorch, JAX, Tensorflow, and more. LitServe is designed to let users focus on model performance, not the serving boilerplate. It is like PyTorch Lightning for model serving but with broader framework support and scalability.
README:
High-throughput serving engine for AI models
✅ Batching ✅ Streaming ✅ Auto-GPU, multi-GPU ✅ PyTorch/JAX/TF ✅ Full control ✅ Auth
LitServe is a high-throughput serving engine for deploying AI models at scale. LitServe generates an API endpoint for a model, handles batching, streaming, autoscaling across CPU/GPUs and more.
Why we wrote LitServe:
- Work with any model: LLMs, vision, time-series, etc...
- We wanted a zero abstraction, minimal, hackable code-base without bloat.
- Built for enterprise scale (not demos, etc...).
- Easy enough for researchers, scalable and hackable for engineers.
- Work on any hardware (GPU/TPU) automatically.
- Let you focus on model performance, not the serving boilerplate.
Think of LitServe as PyTorch Lightning for model serving (if you're familiar with Lightning) but supports every framework like PyTorch, JAX, Tensorflow and more.
Explore various examples that show different models deployed with LitServe:
| Example | description | Run |
|---|---|---|
| Hello world | Hello world model | |
| ANY Hugging face model | (Text) Deploy any Hugging Face model | |
| Hugging face BERT model | (Text) Deploy model for tasks like text generation and more | |
| Open AI CLIP | (Multimodal) Deploy Open AI CLIP for tasks like image understanding | |
| Open AI Whisper | (Audio) Deploy Open AI Whisper for tasks like speech to text | |
| Stable diffusion 2 | (Vision) Deploy Stable diffusion 2 for tasks like image generation |
Install LitServe via pip:
pip install litserveAdvanced install options
Install the main branch:
pip install git+https://github.com/Lightning-AI/litserve.git@main
Install from source:
git clone https://github.com/Lightning-AI/litserve
cd litserve
pip install -e '.[all]'
LitServe is an inference server for AI/ML models that is minimal and highly scalable.
It has 2 simple, minimal APIs - LitAPI and LitServer.
Here's a hello world example:
# server.py
import litserve as ls
# STEP 1: DEFINE YOUR MODEL API
class SimpleLitAPI(ls.LitAPI):
def setup(self, device):
# Setup the model so it can be called in `predict`.
self.model = lambda x: x**2
def decode_request(self, request):
# Convert the request payload to your model input.
return request["input"]
def predict(self, x):
# Run the model on the input and return the output.
return self.model(x)
def encode_response(self, output):
# Convert the model output to a response payload.
return {"output": output}
# STEP 2: START THE SERVER
if __name__ == "__main__":
api = SimpleLitAPI()
server = ls.LitServer(api, accelerator="auto")
server.run(port=8000)Now run the server via the command-line
python server.pyLitServe automatically generates a client when it starts. Use this client to test the server:
python client.pyOr ping the server yourself directly
import requests
response = requests.post("http://127.0.0.1:8000/predict", json={"input": 4.0})The server expects the client to send a POST to the /predict URL with a JSON payload.
The way the payload is structured is up to the implementation of the LitAPI subclass.
LitServe supports multiple advanced state-of-the-art features.
| Feature | description |
|---|---|
| Accelerators | CPU, GPU, Multi-GPU, mps |
| Auto-GPU | Detects and auto-runs on all GPUs on a machine |
| Model types | LLMs, Vision, Time series, any model type... |
| ML frameworks | PyTorch, Jax, Tensorflow, numpy, etc... |
| Batching | ✅ |
| API authentication | ✅ |
| Full request/response control | ✅ |
| Automatic schema validation | ✅ |
| Handle timeouts | ✅ |
| Handle disconnects | ✅ |
| Streaming | ✅ |
[!NOTE] Our goal is not to jump on every hype train, but instead support features that scale under the most demanding enterprise deployments.
Explore each feature in detail:
Use accelerators automatically (GPUs, CPU, mps)
LitServe automatically detects GPUs on a machine and uses them when available:
import litserve as ls
# Automatically selects the available accelerator
api = SimpleLitAPI() # defined by you with ls.LitAPI
# when running on GPUs these are equivalent. It's best to let Lightning decide by not specifying it!
server = ls.LitServer(api)
server = ls.LitServer(api, accelerator="cuda")
server = ls.LitServer(api, accelerator="auto")LitServer accepts an accelerator argument which defaults to "auto". It can also be explicitly set to "cpu", "cuda", or
"mps" if you wish to manually control the device placement.
The following example shows how to set the accelerator manually:
import litserve as ls
# Run on CUDA-supported GPUs
server = ls.LitServer(SimpleLitAPI(), accelerator="cuda")
# Run on Apple's Metal-powered GPUs
server = ls.LitServer(SimpleLitAPI(), accelerator="mps")Serve on multi-GPUs
LitServer has the ability to coordinate serving from multiple GPUs.
LitServer accepts a devices argument which defaults to "auto". On multi-GPU machines, LitServe
will run a copy of the model on each device detected on the machine.
The devices argument can also be explicitly set to the desired number of devices to use on the machine.
import litserve as ls
# Automatically selects the available accelerators
api = SimpleLitAPI() # defined by you with ls.LitAPI
# when running on a 4-GPUs machine these are equivalent.
# It's best to let Lightning decide by not specifying accelerator and devices!
server = ls.LitServer(api)
server = ls.LitServer(api, accelerator="cuda", devices=4)
server = ls.LitServer(api, accelerator="auto", devices="auto")For example, running the API server on a 4-GPU machine, with a PyTorch model served on each GPU:
import torch, torch.nn as nn
import litserve as ls
class Linear(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(1, 1)
self.linear.weight.data.fill_(2.0)
self.linear.bias.data.fill_(1.0)
def forward(self, x):
return self.linear(x)
class SimpleLitAPI(ls.LitAPI):
def setup(self, device):
# move the model to the correct device
# keep track of the device for moving data accordingly
self.model = Linear().to(device)
self.device = device
def decode_request(self, request):
# get the input and create a 1D tensor on the correct device
content = request["input"]
return torch.tensor([content], device=self.device)
def predict(self, x):
# the model expects a batch dimension, so create it
return self.model(x[None, :])
def encode_response(self, output):
# float will take the output value directly onto CPU memory
return {"output": float(output)}
if __name__ == "__main__":
# accelerator="auto" (or "cuda"), devices="auto" (or 4) will lead to 4 workers serving
# the model from "cuda:0", "cuda:1", "cuda:2", "cuda:3" respectively
server = ls.LitServer(SimpleLitAPI(), accelerator="auto", devices="auto")
server.run(port=8000)The devices argument can also be an array specifying what device id to
run the model on:
server = LitServer(SimpleLitAPI(), accelerator="cuda", devices=[0, 3])Last, you can run multiple copies of the same model from the same device, if the model is small. The following will load two copies of the model on each of the 4 GPUs:
server = LitServer(SimpleLitAPI(), accelerator="cuda", devices=4, workers_per_device=2)Timeouts and disconnections
The server will remove a queued request if the client requesting it disconnects.
You can configure a timeout (in seconds) after which clients will receive a 504 HTTP
response (Gateway Timeout) indicating that their request has timed out.
For example, this is how you can configure the server with a timeout of 30 seconds per response.
server = LitServer(SimpleLitAPI(), accelerator="cuda", devices=4, timeout=30)This is useful to avoid requests queuing up beyond the ability of the server to respond.
To disable the timeout for long-running tasks, set timeout=False or timeout=-1:
server = LitServer(SimpleLitAPI(), timeout=False)Use API key authentication
In order to secure the API behind an API key, just define the env var when starting the server
LIT_SERVER_API_KEY=supersecretkey python main.pyClients are expected to auth with the same API key set in the X-API-Key HTTP header.
Dynamic batching
LitServe can combine individual requests into a batch to improve throughput.
To enable batching, you need to set the max_batch_size argument to match the batch size that your model can handle
and implement LitAPI.predict to process batched inputs.
import numpy as np
import litserve as ls
class SimpleStreamAPI(ls.LitAPI):
def setup(self, device) -> None:
self.model = lambda x: x ** 2
def decode_request(self, request):
return np.asarray(request["input"])
def predict(self, x):
result = self.model(x)
return result
def encode_response(self, output):
return {"output": output}
if __name__ == "__main__":
api = SimpleStreamAPI()
server = ls.LitServer(api, max_batch_size=4, batch_timeout=0.05)
server.run(port=8000)You can control the wait time to aggregate requests into a batch with the batch_timeout argument.
In the above example, the server will wait for 0.05 seconds to combine 4 requests together.
LitServe automatically stacks NumPy arrays and PyTorch tensors along the batch dimension before calling the
LitAPI.predict method, and splits the output across requests afterward. You can customize this behavior by overriding the
LitAPI.batch and LitAPI.unbatch methods to handle different data types.
class SimpleStreamAPI(ls.LitAPI):
...
def batch(self, inputs):
return np.stack(inputs)
def unbatch(self, output):
return list(output)
...Stream long responses
LitServe can stream outputs from the model in real-time, such as returning text one word at a time from a language model.
To enable streaming, you need to set LitServer(..., stream=True) and implement LitAPI.predict and LitAPI.encode_response
as a generator (a Python function that yields output).
For example, streaming long responses generated over time:
import json
import litserve as ls
class SimpleStreamAPI(ls.LitAPI):
def setup(self, device) -> None:
self.model = lambda x, y: x * y
def decode_request(self, request):
return request["input"]
def predict(self, x):
for i in range(10):
yield self.model(x, i)
def encode_response(self, output):
for out in output:
yield json.dumps({"output": out})
if __name__ == "__main__":
api = SimpleStreamAPI()
server = ls.LitServer(api, stream=True)
server.run(port=8000)
Automatic schema validation
Define the request and response as Pydantic models, to automatically validate the request.
from pydantic import BaseModel
import litserve as ls
class PredictRequest(BaseModel):
input: float
class PredictResponse(BaseModel):
output: float
class SimpleLitAPI(ls.LitAPI):
def setup(self, device):
self.model = lambda x: x**2
def decode_request(self, request: PredictRequest) -> float:
return request.input
def predict(self, x):
return self.model(x)
def encode_response(self, output: float) -> PredictResponse:
return PredictResponse(output=output)
if __name__ == "__main__":
api = SimpleLitAPI()
server = ls.LitServer(api, accelerator="auto")
server.run(port=8888)LitServe is a community project accepting contributions. Let's make the world's most advanced AI inference engine.
Run tests
Use pytest to run tests locally.
First, install test dependencies:
pip install -r _requirements/test.txtRun the tests
pytest testslitserve is released under the Apache 2.0 license. See LICENSE file for details.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for litserve
Similar Open Source Tools
litserve
LitServe is a high-throughput serving engine for deploying AI models at scale. It generates an API endpoint for a model, handles batching, streaming, autoscaling across CPU/GPUs, and more. Built for enterprise scale, it supports every framework like PyTorch, JAX, Tensorflow, and more. LitServe is designed to let users focus on model performance, not the serving boilerplate. It is like PyTorch Lightning for model serving but with broader framework support and scalability.
react-native-fast-tflite
A high-performance TensorFlow Lite library for React Native that utilizes JSI for power, zero-copy ArrayBuffers for efficiency, and low-level C/C++ TensorFlow Lite core API for direct memory access. It supports swapping out TensorFlow Models at runtime and GPU-accelerated delegates like CoreML/Metal/OpenGL. Easy VisionCamera integration allows for seamless usage. Users can load TensorFlow Lite models, interpret input and output data, and utilize GPU Delegates for faster computation. The library is suitable for real-time object detection, image classification, and other machine learning tasks in React Native applications.
langserve
LangServe helps developers deploy `LangChain` runnables and chains as a REST API. This library is integrated with FastAPI and uses pydantic for data validation. In addition, it provides a client that can be used to call into runnables deployed on a server. A JavaScript client is available in LangChain.js.
python-tgpt
Python-tgpt is a Python package that enables seamless interaction with over 45 free LLM providers without requiring an API key. It also provides image generation capabilities. The name _python-tgpt_ draws inspiration from its parent project tgpt, which operates on Golang. Through this Python adaptation, users can effortlessly engage with a number of free LLMs available, fostering a smoother AI interaction experience.
r2ai
r2ai is a tool designed to run a language model locally without internet access. It can be used to entertain users or assist in answering questions related to radare2 or reverse engineering. The tool allows users to prompt the language model, index large codebases, slurp file contents, embed the output of an r2 command, define different system-level assistant roles, set environment variables, and more. It is accessible as an r2lang-python plugin and can be scripted from various languages. Users can use different models, adjust query templates dynamically, load multiple models, and make them communicate with each other.
nano-graphrag
nano-GraphRAG is a simple, easy-to-hack implementation of GraphRAG that provides a smaller, faster, and cleaner version of the official implementation. It is about 800 lines of code, small yet scalable, asynchronous, and fully typed. The tool supports incremental insert, async methods, and various parameters for customization. Users can replace storage components and LLM functions as needed. It also allows for embedding function replacement and comes with pre-defined prompts for entity extraction and community reports. However, some features like covariates and global search implementation differ from the original GraphRAG. Future versions aim to address issues related to data source ID, community description truncation, and add new components.
consult-llm-mcp
Consult LLM MCP is an MCP server that enables users to consult powerful AI models like GPT-5.2, Gemini 3.0 Pro, and DeepSeek Reasoner for complex problem-solving. It supports multi-turn conversations, direct queries with optional file context, git changes inclusion for code review, comprehensive logging with cost estimation, and various CLI modes for Gemini and Codex. The tool is designed to simplify the process of querying AI models for assistance in resolving coding issues and improving code quality.
mistral-inference
Mistral Inference repository contains minimal code to run 7B, 8x7B, and 8x22B models. It provides model download links, installation instructions, and usage guidelines for running models via CLI or Python. The repository also includes information on guardrailing, model platforms, deployment, and references. Users can interact with models through commands like mistral-demo, mistral-chat, and mistral-common. Mistral AI models support function calling and chat interactions for tasks like testing models, chatting with models, and using Codestral as a coding assistant. The repository offers detailed documentation and links to blogs for further information.
text-embeddings-inference
Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for popular models like FlagEmbedding, Ember, GTE, and E5. It implements features such as no model graph compilation step, Metal support for local execution on Macs, small docker images with fast boot times, token-based dynamic batching, optimized transformers code for inference using Flash Attention, Candle, and cuBLASLt, Safetensors weight loading, and production-ready features like distributed tracing with Open Telemetry and Prometheus metrics.
syncode
SynCode is a novel framework for the grammar-guided generation of Large Language Models (LLMs) that ensures syntactically valid output based on a Context-Free Grammar (CFG). It supports various programming languages like Python, Go, SQL, Math, JSON, and more. Users can define custom grammars using EBNF syntax. SynCode offers fast generation, seamless integration with HuggingFace Language Models, and the ability to sample with different decoding strategies.
magentic
Easily integrate Large Language Models into your Python code. Simply use the `@prompt` and `@chatprompt` decorators to create functions that return structured output from the LLM. Mix LLM queries and function calling with regular Python code to create complex logic.
raglite
RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with PostgreSQL or SQLite. It offers configurable options for choosing LLM providers, database types, and rerankers. The toolkit is fast and permissive, utilizing lightweight dependencies and hardware acceleration. RAGLite provides features like PDF to Markdown conversion, multi-vector chunk embedding, optimal semantic chunking, hybrid search capabilities, adaptive retrieval, and improved output quality. It is extensible with a built-in Model Context Protocol server, customizable ChatGPT-like frontend, document conversion to Markdown, and evaluation tools. Users can configure RAGLite for various tasks like configuring, inserting documents, running RAG pipelines, computing query adapters, evaluating performance, running MCP servers, and serving frontends.
mLoRA
mLoRA (Multi-LoRA Fine-Tune) is an open-source framework for efficient fine-tuning of multiple Large Language Models (LLMs) using LoRA and its variants. It allows concurrent fine-tuning of multiple LoRA adapters with a shared base model, efficient pipeline parallelism algorithm, support for various LoRA variant algorithms, and reinforcement learning preference alignment algorithms. mLoRA helps save computational and memory resources when training multiple adapters simultaneously, achieving high performance on consumer hardware.
screen-pipe
Screen-pipe is a Rust + WASM tool that allows users to turn their screen into actions using Large Language Models (LLMs). It enables users to record their screen 24/7, extract text from frames, and process text and images for tasks like analyzing sales conversations. The tool is still experimental and aims to simplify the process of recording screens, extracting text, and integrating with various APIs for tasks such as filling CRM data based on screen activities. The project is open-source and welcomes contributions to enhance its functionalities and usability.
syncode
SynCode is a novel framework for the grammar-guided generation of Large Language Models (LLMs) that ensures syntactically valid output with respect to defined Context-Free Grammar (CFG) rules. It supports general-purpose programming languages like Python, Go, SQL, JSON, and more, allowing users to define custom grammars using EBNF syntax. The tool compares favorably to other constrained decoders and offers features like fast grammar-guided generation, compatibility with HuggingFace Language Models, and the ability to work with various decoding strategies.
agenticSeek
AgenticSeek is a voice-enabled AI assistant powered by DeepSeek R1 agents, offering a fully local alternative to cloud-based AI services. It allows users to interact with their filesystem, code in multiple languages, and perform various tasks autonomously. The tool is equipped with memory to remember user preferences and past conversations, and it can divide tasks among multiple agents for efficient execution. AgenticSeek prioritizes privacy by running entirely on the user's hardware without sending data to the cloud.
For similar tasks
litserve
LitServe is a high-throughput serving engine for deploying AI models at scale. It generates an API endpoint for a model, handles batching, streaming, autoscaling across CPU/GPUs, and more. Built for enterprise scale, it supports every framework like PyTorch, JAX, Tensorflow, and more. LitServe is designed to let users focus on model performance, not the serving boilerplate. It is like PyTorch Lightning for model serving but with broader framework support and scalability.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.
zep-python
Zep is an open-source platform for building and deploying large language model (LLM) applications. It provides a suite of tools and services that make it easy to integrate LLMs into your applications, including chat history memory, embedding, vector search, and data enrichment. Zep is designed to be scalable, reliable, and easy to use, making it a great choice for developers who want to build LLM-powered applications quickly and easily.
AI-in-a-Box
AI-in-a-Box is a curated collection of solution accelerators that can help engineers establish their AI/ML environments and solutions rapidly and with minimal friction, while maintaining the highest standards of quality and efficiency. It provides essential guidance on the responsible use of AI and LLM technologies, specific security guidance for Generative AI (GenAI) applications, and best practices for scaling OpenAI applications within Azure. The available accelerators include: Azure ML Operationalization in-a-box, Edge AI in-a-box, Doc Intelligence in-a-box, Image and Video Analysis in-a-box, Cognitive Services Landing Zone in-a-box, Semantic Kernel Bot in-a-box, NLP to SQL in-a-box, Assistants API in-a-box, and Assistants API Bot in-a-box.
NeMo
NeMo Framework is a generative AI framework built for researchers and pytorch developers working on large language models (LLMs), multimodal models (MM), automatic speech recognition (ASR), and text-to-speech synthesis (TTS). The primary objective of NeMo is to provide a scalable framework for researchers and developers from industry and academia to more easily implement and design new generative AI models by being able to leverage existing code and pretrained models.
E2B
E2B Sandbox is a secure sandboxed cloud environment made for AI agents and AI apps. Sandboxes allow AI agents and apps to have long running cloud secure environments. In these environments, large language models can use the same tools as humans do. For example: * Cloud browsers * GitHub repositories and CLIs * Coding tools like linters, autocomplete, "go-to defintion" * Running LLM generated code * Audio & video editing The E2B sandbox can be connected to any LLM and any AI agent or app.
floneum
Floneum is a graph editor that makes it easy to develop your own AI workflows. It uses large language models (LLMs) to run AI models locally, without any external dependencies or even a GPU. This makes it easy to use LLMs with your own data, without worrying about privacy. Floneum also has a plugin system that allows you to improve the performance of LLMs and make them work better for your specific use case. Plugins can be used in any language that supports web assembly, and they can control the output of LLMs with a process similar to JSONformer or guidance.
For similar jobs
sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.
teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.
chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.
BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students
uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.
griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.