AngelSlim

AngelSlim

Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.

Stars: 114

Visit
 screenshot

AngelSlim is a comprehensive and efficient large model compression toolkit designed to be user-friendly. It integrates mainstream compression algorithms for easy one-click access, continuously innovates compression algorithms, and optimizes end-to-end performance in model compression and deployment. It supports various models for quantization and speculative sampling, with a focus on performance optimization and ease of use.

README:

简体中文 | English

AngelSlim

致力于打造更易用、更全面和更高效的大模型压缩工具包

📖 Documentation&nbsp&nbsp | &nbsp&nbsp🤗 Hugging Face&nbsp&nbsp | &nbsp&nbsp🤖 ModelScope&nbsp&nbsp | &nbsp&nbsp💬 WeChat (微信) | &nbsp&nbsp🫨 Discord

目录

📣最新进展

  • [25/09/01] 我们支持了Hunyuan-MT-7B翻译开源模型的FP8量化;支持了Eagle3的Torch推理及Benchmark评测流程;支持了FLUX的量化、Cache;支持了Seed-OSS模型量化压缩。
  • [25/08/06] 我们支持了Hunyuan 0.5B/1.8B/4B/7BQwen2.5VL 3B/7B/32B/72B的FP8、INT4量化,支持了DeepSeek-R1/V3Kimi-K2模型的FP8-StaticW4A8-FP8量化。我们还开源了Hunyuan 1.8B/4B/7B系列模型的Eagle3权重。
  • [25/07/04] 我们支持了Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwen等模型的量化,包含INT8、FP8、INT4等算法。 我们还开源了Qwen3系列模型的Eagle3权重。

Coming soon:

  • [ ] Diffusion模型压缩支持
  • [ ] 投机采样新算法发布

🌟主要特性

  • 高度集成化:本工具将主流的压缩算法集成到工具,开发者可一键式调用,具有很好的易用性。
  • 持续算法创新:本工具除了集成工业界使用最广的算法,还持续自研更好的压缩算法,并且会陆续开源。
  • 追求极致性能:在模型压缩流程、压缩算法部署方面,本工具持续端到端优化,例如单卡GPU可量化Qwen3-235B和Deepseek-R1。

💼支持模型

量化

目前已支持文生文任务Hunyuan-Dense、Hunyuan-MoE、Qwen3-Dense、Qwen3-MoE、Qwen2.5、DeepSeek-R1蒸馏Qwen模型、QwQ等系列的主要模型:

模型名 FP8-Dynamic FP8-Static INT8-Dynamic INT4-GPTQ INT4-AWQ
Hunyuan-Dense
Hunyuan-MoE
Qwen3-Dense
Qwen3-MoE
Qwen2.5
DeepSeek-R1-Distill-Qwen
QwQ

投机采样

Eagle3

目前已开源Qwen3和Hunyuan系列模型的Eagle3权重。

Qwen3 Models Hunyuan Models
Qwen3-1.7B Hunyuan-1.8B-Instruct
Qwen3-4B Hunyuan-4B-Instruct
Qwen3-8B Hunyuan-7B-Instruct
Qwen3-14B
Qwen3-32B
Qwen3-30B-A3B

🛎️如何使用

安装 AngelSlim

推荐使用pip直接安装最新稳定版AngelSlim

pip install angelslim

也可以选择克隆代码仓库后,以可编辑的方式从源代码安装:

cd AngelSlim && python setup.py install

更详细的安装说明可参考安装文档

快速开始

量化

完成安装AngelSlim后,您可以通过以下脚本快速开始,完成Qwen3-1.7B模型的静态FP8量化:

  • 一键式启动

    python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yaml

    该示例将会加载HugggingFace模型, 使用config配置的dataset数据进行激活值校准,量化产出模型权重.

  • 源码启动

    Qwen3-1.7B完成动态FP8量化:

    from angelslim.engine import Engine
    
    slim_engine = Engine()
    # Prepare model
    slim_engine.prepare_model(model_name="Qwen", model_path="Qwen/Qwen3-1.7B",)
    # Initialize compressor
    slim_engine.prepare_compressor("PTQ", default_method="fp8_dynamic")
    # Compress model
    slim_engine.run()
    # Save compressed model
    slim_engine.save("./output")

详情请参考快速开始文档

投机采样

完成安装AngelSlim后,您可以通过以下脚本快速开始,完成Eagle3的Pytorch性能测试:

python3 tools/spec_benchmark.py \
    --base-model-path /path/to/base/model \
    --eagle-model-path /path/to/eagle/model \
    --model-id your_model_id \
    --mode both

详情请参考快速开始文档

部署与测试

1. 离线推理

如果需要通过transformers加载量化模型,请在量化模型配置的global中设置deploy_backend: huggingface,或者直接手动将量化产出模型路径下config.json配置中的key ignored_layers改为ignore

测试transformers加载量化模型离线推理:

python deploy/offline.py $MODEL_PATH

其中 MODEL_PATH 为量化产出模型路径。

2. 服务部署

支持通过以下推理框架部署 OpenAI 兼容的 API 服务:

vLLM

vLLM 服务启动脚本,建议版本vllm>=0.8.5.post1,部署MOE INT8量化模型需要vllm>=0.9.2

bash deploy/run_vllm.sh $MODEL_PATH

SGLang

SGLang 服务启动脚本,建议版本 sglang>=0.4.6.post1

bash deploy/run_sglang.sh $MODEL_PATH

3. 服务调用

通过 OpenAI 格式 接口发起请求:

bash deploy/openai.sh $MODEL_PATH

4. 效果验证

使用 lm-evaluation-harness 评估量化模型精度,建议版本lm-eval>=0.4.8

bash deploy/lm_eval.sh $MODEL_PATH

详细操作指南请参阅部署文档

📈Benchmark

(1)量化

下面只展示了部分模型的效果测试情况,完整Benchmark可以参考Benchmark文档

Hunyuan系列模型

Hunyuan-Instruct的BF16FP8INT4-GPTQINT4-AWQOlympiadBenchAIME 2024DROPGPQA-Diamond上的评测结果如下:

Model Quantization OlympiadBench AIME 2024 DROP GPQA-Diamond
Hunyuan-A13B-Instruct BF16 82.7 87.30 91.1 71.2
FP8-Static 83.0 86.7 91.1 -
Int4-GPTQ 82.7 86.7 91.1 -
Int4-AWQ 82.6 85.6 91.0 -
Hunyuan-7B-Instruct BF16 76.5 81.1 85.9 60.1
FP8-Static 76.6 80.9 86.0 60.1
Int4-GPTQ 76.2 81.0 85.7 60.0
Int4-AWQ 76.4 80.9 85.9 60.1
Hunyuan-4B-Instruct BF16 73.1 78.3 78.2 61.1
FP8-Static 73.1 76.6 78.3 60.2
Int4-GPTQ 72.9 - 78.1 58.1
Int4-AWQ 72.8 - 78.2 -
Hunyuan-1.8B-Instruct BF16 63.4 56.7 76.7 47.2
FP8-Static 62.5 55.2 75.1 47.7
Int4-GPTQ 60.9 - 73.0 44.4
Int4-AWQ 61.7 - 71.7 43.6
Hunyuan-0.5B-Instruct BF16 29.6 17.2 52.8 23.3
FP8-Static 29.6 17.2 51.6 22.5
Int4-GPTQ 26.8 - 50.9 23.3
Int4-AWQ 26.3 - 48.9 23.3

Qwen3系列模型

Qwen3系列模型的BF16FP8-StaticFP8-DynamicINT8-DynamicINT4-GPTQINT4-AWQCEVALMMLUGSM8KHUMANEVAL上的评测结果如下:

Model Quantization CEVAL MMLU GSM8K HUMANEVAL
Qwen3-0.6B BF16 45.84 47.21 42.99 19.51
FP8-Static 45.99 46.87 38.06 18.90
FP8-Dynamic 45.99 46.93 38.29 20.73
INT8-Dynamic 45.17 46.95 41.17 21.34
Qwen3-8B BF16 79.27 74.78 87.79 63.41
FP8-Static 78.23 74.79 86.96 62.20
FP8-Dynamic 78.45 74.75 87.64 62.80
INT8-Dynamic 78.01 74.84 86.96 67.07
INT4-GPTQ 77.19 73.26 86.43 62.20
INT4-AWQ 76.15 73.59 86.96 63.41
Qwen3-14B BF16 83.06 78.90 88.40 55.49
FP8-Static 82.62 78.57 89.46 57.32
FP8-Dynamic 82.24 78.92 88.32 52.44
INT8-Dynamic 81.87 78.13 86.28 56.10
INT4-GPTQ 81.05 78.02 87.34 57.93
INT4-AWQ 82.02 77.68 84.23 61.59
Qwen3-32B BF16 86.55 82.00 74.53 37.80
FP8-Static 86.92 81.78 70.20 39.63
FP8-Dynamic 86.55 81.89 70.43 38.41
INT4-GPTQ 86.18 81.01 - 43.29
INT4-AWQ 86.18 81.54 - 36.59
Qwen3-30B-A3B BF16 83.66 79.36 89.99 31.71
FP8-Static 83.95 79.47 89.01 31.10
FP8-Dynamic 84.10 79.40 89.16 32.93
INT8-Dynamic 83.36 79.48 89.16 34.15
Qwen3-235B-A22B BF16 89.60 86.28 85.29 27.44
FP8-Static 89.67 86.19 86.96 27.44
FP8-Dynamic 89.67 86.18 85.22 28.05
INT8-Dynamic 88.93 86.20 86.20 23.78
QwQ-32B BF16 85.74 82.03 73.31 42.68
FP8-Static 85.44 81.91 75.36 42.68
FP8-Dynamic 85.07 81.93 75.66 42.07
INT4-GPTQ 84.03 81.26 68.23 45.73
INT4-AWQ 83.58 81.01 68.69 43.29

Qwen2.5VL系列模型

Qwen2.5VL系列模型的BF16FP8-StaticFP8-DynamicINT4-GPTQINT4-AWQMMMU_VALDocVQA_VALChartQA_TEST上的评测结果如下:

Model Quantization MMMU_VAL MMLDocVQA_VALU ChartQA_TEST
Qwen2.5VL-3B BF16 47.11 78.57 80.32
FP8-Static 47.33 79.34 79.68
FP8-Dynamic 45.99 46.93 38.29
INT4-GPTQ 46.56 77.20 78.96
INT4-AWQ 45.78 - 79.60
Qwen2.5VL-7B BF16 45.44 89.71 84.64
FP8-Static 47.00 89.83 85.92
FP8-Dynamic 47.22 89.80 88.64
INT4-GPTQ 46.67 90.45 -
INT4-AWQ 45.67 89.28 -
Qwen2.5VL-32B BF16 57.00 90.03 -
FP8-Static 57.00 89.88 -
FP8-Dynamic 56.44 89.88 -
INT4-GPTQ 55.22 89.80 -
INT4-AWQ 55.22 90.30 -
Qwen2.5VL-72B BF16 58.78 94.39 85.60
FP8-Static 57.89 94.41 85.84
FP8-Dynamic 58.67 94.38 85.60
INT4-GPTQ 57.56 94.46 86.48
INT4-AWQ 58.78 94.19 87.28

DeepSeek系列模型

DeepSeek-R1-0528模型的FP8-Block-WiseW4A8-FP8GPQA DiamondAIME 2024SimpleQALiveCodeBench上的评测结果如下:

Model Quantization GPQA Diamond AIME 2024 SimpleQA LiveCodeBench
DeepSeek-R1-0528 FP8-Block-Wise 78.28 88.67 27.8 77.1
W4A8-FP8 77.37 88.67 26.83 78.86

备注

  • 以上评测结果使用TRT-LLM框架部署测试5次求平均
  • 评测时使用的超参如下:
{
 "top_k": 20,
 "top_p": 0.6,
 "temperature": 0.7,
 "output_seq_len": 32768,
 "max_input_seq_len": 16384
}

其他模型

其他模型的BF16FP8-StaticFP8-DynamicINT4-GPTQINT4-AWQCEVALMMLUGSM8K上的评测结果如下:

Model Quantization CEVAL MMLU GSM8K
Qwen2.5-1.5B-Instruct BF16 67.01 60.05 54.28
FP8-Static 66.27 60.23 -
FP8-Dynamic 66.79 60.08 51.71
Qwen2.5-7B-Instruct BF16 81.20 74.55 79.98
FP8-Static 81.13 74.03 79.30
FP8-Dynamic 80.31 74.07 79.00
INT4-GPTQ 79.05 73.05 74.75
INT4-AWQ 79.35 73.22 79.38
Qwen2.5-32B-Instruct BF16 87.30 83.21 81.73
FP8-Static 87.59 83.08 81.58
FP8-Dynamic 87.30 83.04 81.58
INT4-GPTQ 86.70 82.45 82.03
INT4-AWQ 87.00 82.64 -
DeepSeek-R1-Distill-Qwen-7B BF16 53.49 53.80 75.74
FP8-Static 53.57 54.17 76.19
FP8-Dynamic 52.97 54.13 74.15
INT4-GPTQ 51.86 52.44 75.89
INT4-AWQ 53.49 53.70 -
DeepSeek-R1-Distill-Qwen-14B BF16 77.71 74.28 85.67
FP8-Static 77.56 74.66 86.73
FP8-Dynamic 76.82 74.63 87.11
INT4-GPTQ 74.29 72.37 84.61
INT4-AWQ 74.81 73.00 86.05
DeepSeek-R1-Distill-Qwen-32B BF16 84.18 80.89 87.41
FP8-Static 83.43 80.90 87.57
FP8-Dynamic 83.73 81.10 86.43
INT4-GPTQ 84.10 79.80 86.73
INT4-AWQ 82.84 80.15 87.19

(2)投机采样

Qwen3 Series Models

Qwen3系列的Eagle3模型在MT-bench/HunmanEval/GSM8K/Alpaca上的加速结果如下:

&nbsp &nbsp MT-bench HumanEval GSM8K Alpaca Mean
Temperature Model Speedup τ Speedup τ Speedup τ Speedup τ Speedup τ
T=0 Qwen3-1.7B 2.05x 2.81 2.07x 2.93 2.11x 2.98 1.93x 2.69 2.04x 2.85
Qwen3-4B 2.21x 3.01 2.36x 3.24 2.42x 3.13 2.32x 2.75 2.33x 3.03
Qwen3-8B 2.63x 3.65 2.76x 3.85 2.82x 3.90 2.62x 3.48 2.70x 3.72
Qwen3-14B 2.23x 3.30 2.53x 3.74 2.56x 3.79 2.16x 3.13 2.37x 3.49
Qwen3-32B 2.39x 2.78 2.37x 2.81 2.47x 2.92 2.42x 2.53 2.41x 2.76
Qwen3-30B-A3B 2.84x 3.63 2.27x 3.09 2.64x 3.42 2.83x 3.56 2.64x 3.42
T=1 Qwen3-1.7B 1.74x 2.53 1.86x 2.70 1.82x 2.69 1.72x 2.46 1.93x 2.60
Qwen3-4B 1.93x 2.60 2.00x 2.84 2.11x 2.82 2.34x 2.50 1.75x 2.69
Qwen3-8B 1.98x 2.75 2.25x 3.11 2.31x 3.15 2.10x 2.76 2.90x 2.94
Qwen3-14B 1.71x 2.61 1.95x 2.87 2.04x 3.08 1.68x 2.55 2.90x 2.78
Qwen3-32B 1.62x 1.91 1.71x 2.05 1.78x 2.10 1.80x 1.95 1.62x 2.00
Qwen3-30B-A3B 1.91x 2.46 2.00x 2.64 1.90x 2.53 1.80x 2.32 1.90x 2.48

Hunyuan系列的Eagle3模型在MT-bench/HunmanEval/GSM8K/Alpaca上的加速结果如下:

&nbsp &nbsp MT-bench HumanEval GSM8K Alpaca Mean
Temperature Model Speedup τ Speedup τ Speedup τ Speedup τ Speedup τ
T=0 Hunyuan-1.8B-Instruct 1.97x 2.90 2.58x 3.73 2.61x 3.71 1.71x 2.43 2.22x 3.19
Hunyuan-4B-Instruct 1.77x 2.60 2.64x 3.35 2.14x 3.17 1.72x 2.57 2.07x 2.92
Hunyuan-7B-Instruct 2.22x 3.58 3.59x 5.47 2.96x 4.68 1.64x 2.56 2.60x 4.07
T=1 Hunyuan-1.8B-Instruct 1.58x 2.36 2.35x 3.56 2.23x 3.38 1.26x 1.87 1.86x 2.79
Hunyuan-4B-Instruct 1.36x 2.05 1.97x 2.86 1.72x 2.68 1.14x 1.76 1.55x 2.34
Hunyuan-7B-Instruct 1.90x 3.11 3.12x 5.09 2.74x 4.34 1.47x 2.39 2.31x 3.73

📝许可协议

本项目的代码依照 License for AngelSlim 协议开源。

🔗引用

@software{AngelSlim2025,
    title={{AngelSlim}},
    author={Tencent AngelSlim Project Contributors},
    year={2025},
    month={7},
    url={https://github.com/Tencent/AngelSlim},
}

💬技术交流

  • AngelSlim正在快速迭代更新中,后续会推出更多的功能,有问题或建议欢迎通过GitHub Issues给我们提issue,或者加入微信技术交流群

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for AngelSlim

Similar Open Source Tools

For similar tasks

For similar jobs