Fast-dLLM

Fast-dLLM

Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"

Stars: 409

Visit
 screenshot

Fast-DLLM is a diffusion-based Large Language Model (LLM) inference acceleration framework that supports efficient inference for models like Dream and LLaDA. It offers fast inference support, multiple optimization strategies, code generation, evaluation capabilities, and an interactive chat interface. Key features include Key-Value Cache for Block-Wise Decoding, Confidence-Aware Parallel Decoding, and overall performance improvements. The project structure includes directories for Dream and LLaDA model-related code, with installation and usage instructions provided for using the LLaDA and Dream models.

README:

Fast-DLLM

Project arXiv

Fast-DLLM is a diffusion-based Large Language Model (LLM) inference acceleration framework that supports efficient inference for models like Dream and LLaDA.

Demo

https://github.com/user-attachments/assets/32bbff97-6e60-4e14-95c0-2cbec136476f

End-to-end speedup over vanilla LLaDA baseline

End-to-end speedup over vanilla LLaDA baseline

News

  • [2025.08.01] Our new online demo of Fast-dLLM: https://fast-dllm.hanlab.ai/, welcome to try!
  • [2025.07.06] Added factor-based parallel strategy and LLaDA-1.5 evaluation in llada/eval_gsm8k.sh.
  • [2025.07.04] We updated our paper with latest improvements and evaluation results.
  • [2025.06.30] Fast-dLLM has been integrated into LLaDA-V. With Fast-dLLM, it acceleates the inference latency from 60s to 6s! Have a try here!!

Project Structure

.
├── dream/          # Dream model related code
├── llada/          # LLaDA model related code
└── .gitignore      # Git ignore configuration

Features

  • Fast inference support for Dream and LLaDA models
  • Multiple inference optimization strategies
  • Code generation and evaluation capabilities
  • Interactive chat interface

Key Features

  1. Key-Value Cache for Block-Wise Decoding We propose an efficient block-wise decoding KV Cache mechanism for Masked Diffusion Models (MDMs). By reusing attention Key-Value activations across multiple steps within each block, our approach avoids redundant computation and significantly accelerates inference. Furthermore, our DualCache extension also caches masked suffix tokens, enabling even greater speedup with negligible accuracy loss.
KV Cache for block-wise decoding

KV Cache for block-wise decoding

  1. Confidence-Aware Parallel Decoding Instead of decoding tokens sequentially, we introduce a confidence-aware parallel decoding scheme. At each step, only tokens with confidence over a threshold are unmasked in parallel, while uncertain ones remain masked for future steps. This selective approach effectively balances decoding efficiency and output quality.
Decoding comparison

Left: Standard decoding (LLaDA). Right: Confidence-aware parallel decoding.

Pseudo code for our method

Pseudo code for our method

  1. Overall Performance Overall, introducing the KV Cache mechanism yields significant speed improvements for all tasks and sequence lengths, typically achieving a 2x to 3.6x speedup compared to the vanilla backbone. When the parallel decoding strategy is applied individually, we see additional acceleration, often pushing speedups to 4x-6x for the evaluated settings, particularly as the generation length increases.
Overall performance

Overall performance comparison

Installation

  1. Clone the repository:
git clone https://github.com/your-username/fast-dllm.git
cd fast-dllm
  1. Install dependencies:
pip install -r requirements.txt

Usage

1. Using LLaDA Model

Interactive Chat

python llada/chat.py --gen_length 128 --steps 128 --block_size 32

Parameter descriptions:

  • --gen_length: Maximum length of generated text
  • --steps: Number of sampling steps
  • --block_size: Cache block size
  • --use_cache: Whether to use cache
  • --if_cache_position: Whether to use dual cache
  • --threshold: Confidence threshold

Web Demo

We also provide a web demo using Gradio. First, install Gradio:

pip install gradio

Then run the demo:

cd llada
python app.py

Model Evaluation

Benchmark Gen Length LLaDA +Cache +Parallel +Cache+Parallel (Fast-dLLM)
GSM8K (5-shot) 256 79.3
6.73
(1×)
79.5
21.23
(3.2×)
79.2
16.53
(2.5×)
78.5
54.4
(8.1×)
512 77.5
3.23
(1×)
77.0
10.43
(3.3×)
77.6
18.63
(5.8×)
77.2
35.3
(11.0×)
HumanEval (0-shot) 256 41.5
30.5 (1×)
42.7
40.73
(1.3×)
43.9
101.53
(3.3×)
43.3
114.1
(3.7×)
512 43.9
18.4 (1×)
45.7
29.33
(1.6×)
43.3
57.13
(3.1×)
44.5
73.7
(4.0×)

Each cell presents the accuracy (top row, in percentage) and the decoding throughput (middle row, in tokens per second) with relative speedup (bottom row) to the LLaDA baseline.

For detailed evaluation instructions on GSM8K and HumanEval benchmarks, please refer to LLaDA Evaluation Guide.

2. Using Dream Model

For detailed evaluation instructions on GSM8K and HumanEval benchmarks, please refer to Dream Evaluation Guide.

Contributing

Issues and Pull Requests are welcome!

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Citation

If you find this work useful, please cite our paper:

@misc{wu2025fastdllmtrainingfreeaccelerationdiffusion,
      title={Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding}, 
      author={Chengyue Wu and Hao Zhang and Shuchen Xue and Zhijian Liu and Shizhe Diao and Ligeng Zhu and Ping Luo and Song Han and Enze Xie},
      year={2025},
      eprint={2505.22618},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.22618}, 
}

Acknowledgements

We would like to thank the authors of LLaDA and Dream for their excellent work and open-source contributions.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for Fast-dLLM

Similar Open Source Tools

For similar tasks

For similar jobs