llm-d-inference-sim

llm-d-inference-sim

A light weight vLLM simulator, for mocking out replicas.

Stars: 85

Visit
 screenshot

The `llm-d-inference-sim` is a lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running heavy models. It operates as an OpenAI-compliant server, allowing developers to test clients, schedulers, and infrastructure using realistic request-response cycles, token streaming, and latency patterns. The simulator offers modes of operation, response generation from predefined text or real datasets, latency simulation, tokenization options, LoRA management, KV cache simulation, failure injection, and deployment options for standalone or Kubernetes testing. It supports a subset of standard vLLM Prometheus metrics for observability.

README:

Go Report Card License Join Slack

vLLM Simulator

llm-d-inference-sim is a lightweight, configurable, and real-time simulator designed to mimic the behavior of vLLM without the need for GPUs or running actual heavy models. It operates as a fully OpenAI-compliant server, allowing developers to test clients, schedulers, and infrastructure using realistic request-response cycles, token streaming, and latency patterns.

Why is this required?

Running full LLM inference requires significant GPU resources and introduces non-deterministic latency, making it difficult to isolate infrastructure bugs or iterate quickly on control-plane logic. This simulator decouples development from heavy inference, offering a controlled environment to:

  • Accelerate Infrastructure Development: Test routing, scheduling, and KV cache locality logic without waiting for slow, expensive GPU operations.
  • Ensure Deterministic Testing: simulate precise token timing and latency to isolate performance regressions and bugs in a way that is impossible with non-deterministic real models.
  • Validate Observability: Mirror vLLM’s Prometheus metrics to ensure monitoring and alerting systems are functioning correctly before deploying to production.
  • Test Advanced Features: Safely develop complex logic such as LoRA adapter lifecycles (loading, unloading, and switching) and Disaggregated Prefill integrations.

How it Works

The simulator is designed to act as a drop-in replacement for vLLM, sitting between your client/infrastructure and the void where the GPU usually resides. It processes requests through a configurable simulation engine that governs what is returned and when it is returned.

For detailed configuraiton definitions see the Configuration Guide

Modes of Operation

The simulator decides the content of the response based on two primary modes:

  • Echo Mode (--mode echo):
    Acts as a loopback. The response content mirrors the input (e.g., the last user message in a chat request). Useful for network throughput testing where content validity is irrelevant.
  • Random Mode (--mode random):
    The default mode. Generates synthetic responses based on requested parameters (like max_tokens). Utilizes probabilistic histograms to determine response length. Content is sourced from either a set of pre-defined sentences or a custom dataset (see below).

Dual Protocol Support

Natively supports both HTTP (OpenAI-compatible) and gRPC (vLLM-compatible) interfaces on the same port, allowing for versatile integration testing across different client architectures.

For detailed API definitions see the APIs Guide.

Response Generation & Datasets

In Random Mode, the simulator can generate content in two ways:

  • Predefined Text: By default, it constructs responses by concatenating random sentences from a built-in list until the target token length is met.

  • Real Datasets: If a dataset is provided (via --dataset-path or --dataset-url), the simulator attempts to match the hash of the incoming prompt to a conversation history in the database. If a match is found, it returns the stored response. If no match is found, it falls back to a random response from the dataset or predefined text.
    Supports downloading SQLite datasets directly from HuggingFace.

For response generation algorithms details see Response Generation Guide.

Latency Simulation

Unlike simple mock servers that just "sleep" for a fixed time, this simulator models the physics of LLM inference:

  • Time to first token: Simulates the prefill phase latency, including configurable standard deviation (jitter) for realism.

  • Inter-token latency: Simulates the decode phase, adding delays between every subsequent token generation.

  • Load Simulation: The simulator automatically increases latency as the number of concurrent requests becomes higher.

  • Disaggregated Prefill (PD): Can simulate KV-cache transfer latency instead of standard TTFT when mimicking Prefill/Decode disaggregation architectures.

Tokenization

The simulator offers flexible tokenization to balance accuracy vs. performance. The simulator automatically selects between two tokenization modes based on the provided --model name:

  • HuggingFace Mode: Used for real models (e.g., meta-llama/Llama-3.1-8B-Instruct). Downloads actual tokenizers for exact accuracy.
  • Simulated Mode: Used for dummy/non-existent model names. Uses a fast regex tokenizer for maximum performance with zero startup overhead.

For details on caching, environment variables (HF_TOKEN), and performance tuning, see the Tokenization Guide.

LoRA Management

Simulates the lifecycle (loading/unloading) of LoRA adapters without occupying actual memory. Reports LoRA related Prometheus metrics.

KV Cache Simulation

Tracks simulated memory usage and publishes ZMQ events for cache block allocation and eviction.

Failure Injection

Can randomly inject specific errors (e.g., rate_limit, model_not_found) to test client resilience.

Deployment Options

The simulator is designed to run either as a standalone binary or within a Kubernetes Pod (e.g., for testing with Kind).

Observability

The simulator supports a subset of standard vLLM Prometheus metrics.

For detailes see the Metrics Guide

Working with docker image

Building

To build a Docker image of the vLLM Simulator, run:

make image-build

Please note that the default image tag is ghcr.io/llm-d/llm-d-inference-sim:dev.

The following environment variables can be used to change the image tag

Variable Descriprtion Default Value
IMAGE_REGISTRY Name of the repo ghcr.io/llm-d
IMAGE_TAG_BASE Image base name $(IMAGE_REGISTRY)/llm-d-inference-sim
SIM_TAG Image tag dev
IMG The full image specification $(IMAGE_TAG_BASE):$(SIM_TAG)

Running

To run the vLLM Simulator image under Docker, run:

docker run --rm --publish 8000:8000 ghcr.io/llm-d/llm-d-inference-sim:dev  --port 8000 --model "Qwen/Qwen2.5-1.5B-Instruct"  --lora-modules '{"name":"tweet-summary-0"}' '{"name":"tweet-summary-1"}'

Note: To run the vLLM Simulator with the latest release version, in the above docker command replace dev with the current release which can be found on GitHub.

Note: The above command exposes the simulator on port 8000, and serves the Qwen/Qwen2.5-1.5B-Instruct model.

Standalone testing

Building

To build the vLLM simulator to run locally as an executable, run:

make build

Running

To run the vLLM simulator in a standalone test environment:

  1. Set the PYTHONPATH environment variable (needed for the tokenization code) by running:
. env-setup.sh
  1. Start the simulator:
./bin/llm-d-inference-sim --model my_model --port 8000

Kubernetes testing

To run the vLLM simulator in a Kubernetes cluster, run:

kubectl apply -f manifests/deployment.yaml

When testing locally with kind, build the docker image with make build-image then load into the cluster:

kind load --name kind docker-image ghcr.io/llm-d/llm-d-inference-sim:dev

Update the deployment.yaml file to use the dev tag.

To verify the deployment is available, run:

kubectl get deployment vllm-llama3-8b-instruct
kubectl get service vllm-llama3-8b-instruct-svc

Use kubectl port-forward to expose the service on your local machine:

kubectl port-forward svc/vllm-llama3-8b-instruct-svc 8000:8000

Test the API with curl

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Prefill/Decode (P/D) Separation Example

An example configuration for P/D (Prefill/Decode) disaggregation deployment can be found in manifests/disaggregation.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for llm-d-inference-sim

Similar Open Source Tools

For similar tasks

For similar jobs