MinivLLM

MinivLLM

Based on Nano-vLLM, a simple replication of vLLM with self-contained paged attention and flash attention implementation

Stars: 421

Visit
 screenshot

A custom implementation of vLLM inference engine with attention mechanism benchmarks, based on Nano-vLLM but with self-contained paged attention and flash attention implementation. It provides benchmarking on flash attention in prefilling time and paged attention in decoding time. The tool showcases how the custom vLLM implementation handles batched text generation with memory-efficient attention.

README:

图片描述

| English | 简体中文 |

miniVLLM

A custom implementation of vLLM inference engine with attention mechanism benchmarks, based on Nano-vLLM but with self-contained paged attention and flash attention implementation.

Benchmarking on flash attention in prefilling time and paged attention in decoding time are provided.

New to vLLM? Check out HowToApproachvLLM.md for a step-by-step implementation guide covering layers, models, paged attention, CUDA graphs, and scheduling.

Quickstart

# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# Sync dependencies
uv sync

# Run the main inference engine
uv run python main.py

# Run prefilling benchmark
uv run python benchmark_prefilling.py

# Run decoding benchmark
uv run python benchmark_decoding.py

To run multi-GPU setting, simply change world_size to n > 1 in config in main.py

What Each Script Does

uv run python main.py

This is the main inference engine demo

Demonstrates the complete LLM inference pipeline using a custom engine implementation:

  • Create a small version of Qwen3 with random initialization
  • Creates 60 chat prompts (2 base prompts repeated 30 times each)
  • Processes them through the custom LLM engine with batch processing
  • Uses paged attention and KV cache management for efficient inference
  • Generates up to 256 tokens per prompt with temperature sampling

This showcases how the custom vLLM implementation handles batched text generation with memory-efficient attention.

uv run python benchmark_prefilling.py

This is the pefilling phase comparison

Compares three attention implementations during the prefilling phase (processing input prompts):

  1. PyTorch Standard (O(N²) memory): Traditional attention that materializes full attention matrix
  2. Naive Triton (O(N²) memory): GPU kernel that also uses O(N²) memory, limited by shared memory constraints (≤128 tokens)
  3. Flash Attention (O(N) memory): Memory-efficient online softmax algorithm that processes attention in blocks
uv run python benchmark_decoding.py

This is the decoding phase comparison

Compares three implementations during the decoding phase (generating output tokens one at a time):

  1. Naive PyTorch: Simple loop-based implementation using paged KV cache
  2. Optimized PyTorch: Vectorized implementation with batch gathering and masking
  3. Triton Kernel: Custom GPU kernel optimized for paged attention decode

Project Structure

myvllm/
├── src/
│   └── myvllm/           # Core vLLM implementation
│       ├── models/       # Model implementations
│       ├── engine/       # LLM engine logic, including sequence definition for input prompts, block management for KV cache management for GPU, scheduler for iteration-based scheduling of sequences, runner for actual implementation of running prefilling and decoding, and engine for generation API interface
│       ├── layers/       # Components for model/
│       ├── utils/        # context
│       └── sampling_parameters.py
├── main.py              # Full inference demo
├── benchmark_prefilling.py   # Prefilling attention comparison
└── benchmark_decoding.py     # Decoding attention comparison

Requirements

  • Python ≥3.11, < 3.12
  • CUDA-capable GPU
  • Dependencies: transformers, torch, xxhash (managed by uv)

Star History

Star History Chart

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for MinivLLM

Similar Open Source Tools

No tools available

For similar tasks

For similar jobs