SpargeAttn

SpargeAttn

SpargeAttention: A training-free sparse attention that can accelerate any model inference.

Stars: 178

Visit
 screenshot

SpargeAttn is an official implementation designed for accelerating any model inference by providing accurate sparse attention. It offers a significant speedup in model performance while maintaining quality. The tool is based on SageAttention and SageAttention2, providing options for different levels of optimization. Users can easily install the package and utilize the available APIs for their specific needs. SpargeAttn is particularly useful for tasks requiring efficient attention mechanisms in deep learning models.

README:

Sparge Attention

This repository provides the official implementation of SpargeAttn.

SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference
Paper: https://arxiv.org/abs/2502.18137
Jintao Zhang, Chendong Xiang, Haofeng Huang, Haocheng Xi, Jia Wei, Jun Zhu, Jianfei Chen

speed comparison.

Installation

Base environment

  • python>=3.9 , torch>=2.3.0
  • CUDA:
    • >=12.8 for Blackwell
    • >=12.4 for fp8 support on Ada
    • >=12.3 for fp8 support on Hopper
    • >=12.0 for Ampere

Install Package

python setup.py install   # or pip install -e .

Avalible API

Usage Examples

CogVideoX

Tuning:

python evaluate/cogvideo_example.py  --use_spas_sage_attn --model_out_path evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt --tune

Inference:

python evaluate/cogvideo_example.py  --use_spas_sage_attn --model_out_path evaluate/models_dict/CogVideoX-2b_0.06_0.07.pt

Note: We provide pre-tuned hyper-parameters CogVideoX-2b_0.06_0.07.pt that allow you to run the inference script directly. However, for better performance in both speed and quality, we recommend re-tuning because the provided hyper-parameters are tuned with SpargeAttn based on SageAttention, whereas the default API is based on SageAttention2 now.

LLama

The tuning and inference usage is similar to CogVideoX.

Performance

Local Image

Note: All experiments in this paper used SpargeAttn based on SageAttention. An updated implementation based on SageAttention2, is available now. It further offers a 30% speedup.

Citation

If you use this code or find our work valuable, please cite:

@misc{zhang2025spargeattn,
      title={SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference}, 
      author={Jintao Zhang and Chendong Xiang and Haofeng Huang and Jia Wei and Haocheng Xi and Jun Zhu and Jianfei Chen},
      year={2025},
      eprint={2502.18137},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.18137}, 
}

@inproceedings{zhang2025sageattention,
      title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration}, 
      author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Zhu, Jun and Chen, Jianfei},
      booktitle={International Conference on Learning Representations (ICLR)},
      year={2025}
}

@misc{zhang2024sageattention2,
      title={SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization}, 
      author={Jintao Zhang and Haofeng Huang and Pengle Zhang and Jia Wei and Jun Zhu and Jianfei Chen},
      year={2024},
      eprint={2411.10958},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2411.10958}, 
}

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for SpargeAttn

Similar Open Source Tools

For similar tasks

For similar jobs