
AngelSlim
Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.
Stars: 114

AngelSlim is a comprehensive and efficient large model compression toolkit designed to be user-friendly. It integrates mainstream compression algorithms for easy one-click access, continuously innovates compression algorithms, and optimizes end-to-end performance in model compression and deployment. It supports various models for quantization and speculative sampling, with a focus on performance optimization and ease of use.
README:
简体中文 | English
📖 Documentation   |   🤗 Hugging Face   |   🤖 ModelScope   |   💬 WeChat (微信) |   🫨 Discord
- [25/09/01] 我们支持了Hunyuan-MT-7B翻译开源模型的FP8量化;支持了Eagle3的Torch推理及Benchmark评测流程;支持了FLUX的量化、Cache;支持了Seed-OSS模型量化压缩。
- [25/08/06] 我们支持了
Hunyuan 0.5B/1.8B/4B/7B
和Qwen2.5VL 3B/7B/32B/72B
的FP8、INT4量化,支持了DeepSeek-R1/V3
和Kimi-K2
模型的FP8-Static
、W4A8-FP8
量化。我们还开源了Hunyuan 1.8B/4B/7B
系列模型的Eagle3权重。 - [25/07/04] 我们支持了
Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwen
等模型的量化,包含INT8、FP8、INT4等算法。 我们还开源了Qwen3
系列模型的Eagle3权重。
Coming soon:
- [ ] Diffusion模型压缩支持
- [ ] 投机采样新算法发布
- 高度集成化:本工具将主流的压缩算法集成到工具,开发者可一键式调用,具有很好的易用性。
- 持续算法创新:本工具除了集成工业界使用最广的算法,还持续自研更好的压缩算法,并且会陆续开源。
- 追求极致性能:在模型压缩流程、压缩算法部署方面,本工具持续端到端优化,例如单卡GPU可量化Qwen3-235B和Deepseek-R1。
目前已支持文生文任务Hunyuan-Dense、Hunyuan-MoE、Qwen3-Dense、Qwen3-MoE、Qwen2.5、DeepSeek-R1蒸馏Qwen模型、QwQ等系列的主要模型:
模型名 | FP8-Dynamic | FP8-Static | INT8-Dynamic | INT4-GPTQ | INT4-AWQ |
---|---|---|---|---|---|
Hunyuan-Dense | ✅ | ✅ | ✅ | ✅ | ✅ |
Hunyuan-MoE | ✅ | ✅ | ✅ | ✅ | ✅ |
Qwen3-Dense | ✅ | ✅ | ✅ | ✅ | ✅ |
Qwen3-MoE | ✅ | ✅ | ✅ | ✅ | ✅ |
Qwen2.5 | ✅ | ✅ | ✅ | ✅ | ✅ |
DeepSeek-R1-Distill-Qwen | ✅ | ✅ | ✅ | ✅ | ✅ |
QwQ | ✅ | ✅ | ✅ | ✅ | ✅ |
目前已开源Qwen3和Hunyuan系列模型的Eagle3权重。
Qwen3 Models | Hunyuan Models |
---|---|
✅ Qwen3-1.7B | ✅ Hunyuan-1.8B-Instruct |
✅ Qwen3-4B | ✅ Hunyuan-4B-Instruct |
✅ Qwen3-8B | ✅ Hunyuan-7B-Instruct |
✅ Qwen3-14B | |
✅ Qwen3-32B | |
✅ Qwen3-30B-A3B |
推荐使用pip
直接安装最新稳定版AngelSlim
:
pip install angelslim
也可以选择克隆代码仓库后,以可编辑的方式从源代码安装:
cd AngelSlim && python setup.py install
更详细的安装说明可参考安装文档。
完成安装AngelSlim
后,您可以通过以下脚本快速开始,完成Qwen3-1.7B
模型的静态FP8
量化:
-
一键式启动
python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yaml
该示例将会加载
HugggingFace
模型, 使用config
配置的dataset
数据进行激活值校准,量化产出模型权重. -
源码启动
对
Qwen3-1.7B
完成动态FP8
量化:from angelslim.engine import Engine slim_engine = Engine() # Prepare model slim_engine.prepare_model(model_name="Qwen", model_path="Qwen/Qwen3-1.7B",) # Initialize compressor slim_engine.prepare_compressor("PTQ", default_method="fp8_dynamic") # Compress model slim_engine.run() # Save compressed model slim_engine.save("./output")
详情请参考快速开始文档。
完成安装AngelSlim
后,您可以通过以下脚本快速开始,完成Eagle3
的Pytorch性能测试:
python3 tools/spec_benchmark.py \
--base-model-path /path/to/base/model \
--eagle-model-path /path/to/eagle/model \
--model-id your_model_id \
--mode both
详情请参考快速开始文档。
如果需要通过transformers
加载量化模型,请在量化模型配置的global
中设置deploy_backend: huggingface
,或者直接手动将量化产出模型路径下config.json
配置中的key ignored_layers
改为ignore
。
测试transformers
加载量化模型离线推理:
python deploy/offline.py $MODEL_PATH
其中 MODEL_PATH
为量化产出模型路径。
支持通过以下推理框架部署 OpenAI 兼容的 API 服务:
vLLM
vLLM 服务启动脚本,建议版本vllm>=0.8.5.post1
,部署MOE INT8量化模型需要vllm>=0.9.2
。
bash deploy/run_vllm.sh $MODEL_PATH
SGLang
SGLang 服务启动脚本,建议版本 sglang>=0.4.6.post1
:
bash deploy/run_sglang.sh $MODEL_PATH
通过 OpenAI 格式 接口发起请求:
bash deploy/openai.sh $MODEL_PATH
使用 lm-evaluation-harness 评估量化模型精度,建议版本lm-eval>=0.4.8
:
bash deploy/lm_eval.sh $MODEL_PATH
详细操作指南请参阅部署文档。
下面只展示了部分模型的效果测试情况,完整Benchmark可以参考Benchmark文档
Hunyuan-Instruct的BF16
、FP8
、INT4-GPTQ
、INT4-AWQ
在OlympiadBench
、AIME 2024
、DROP
、GPQA-Diamond
上的评测结果如下:
Model | Quantization | OlympiadBench | AIME 2024 | DROP | GPQA-Diamond |
---|---|---|---|---|---|
Hunyuan-A13B-Instruct | BF16 | 82.7 | 87.30 | 91.1 | 71.2 |
FP8-Static | 83.0 | 86.7 | 91.1 | - | |
Int4-GPTQ | 82.7 | 86.7 | 91.1 | - | |
Int4-AWQ | 82.6 | 85.6 | 91.0 | - | |
Hunyuan-7B-Instruct | BF16 | 76.5 | 81.1 | 85.9 | 60.1 |
FP8-Static | 76.6 | 80.9 | 86.0 | 60.1 | |
Int4-GPTQ | 76.2 | 81.0 | 85.7 | 60.0 | |
Int4-AWQ | 76.4 | 80.9 | 85.9 | 60.1 | |
Hunyuan-4B-Instruct | BF16 | 73.1 | 78.3 | 78.2 | 61.1 |
FP8-Static | 73.1 | 76.6 | 78.3 | 60.2 | |
Int4-GPTQ | 72.9 | - | 78.1 | 58.1 | |
Int4-AWQ | 72.8 | - | 78.2 | - | |
Hunyuan-1.8B-Instruct | BF16 | 63.4 | 56.7 | 76.7 | 47.2 |
FP8-Static | 62.5 | 55.2 | 75.1 | 47.7 | |
Int4-GPTQ | 60.9 | - | 73.0 | 44.4 | |
Int4-AWQ | 61.7 | - | 71.7 | 43.6 | |
Hunyuan-0.5B-Instruct | BF16 | 29.6 | 17.2 | 52.8 | 23.3 |
FP8-Static | 29.6 | 17.2 | 51.6 | 22.5 | |
Int4-GPTQ | 26.8 | - | 50.9 | 23.3 | |
Int4-AWQ | 26.3 | - | 48.9 | 23.3 |
Qwen3系列模型的BF16
、FP8-Static
、FP8-Dynamic
、INT8-Dynamic
、INT4-GPTQ
、INT4-AWQ
在CEVAL
、MMLU
、GSM8K
、HUMANEVAL
上的评测结果如下:
Model | Quantization | CEVAL | MMLU | GSM8K | HUMANEVAL |
---|---|---|---|---|---|
Qwen3-0.6B | BF16 | 45.84 | 47.21 | 42.99 | 19.51 |
FP8-Static | 45.99 | 46.87 | 38.06 | 18.90 | |
FP8-Dynamic | 45.99 | 46.93 | 38.29 | 20.73 | |
INT8-Dynamic | 45.17 | 46.95 | 41.17 | 21.34 | |
Qwen3-8B | BF16 | 79.27 | 74.78 | 87.79 | 63.41 |
FP8-Static | 78.23 | 74.79 | 86.96 | 62.20 | |
FP8-Dynamic | 78.45 | 74.75 | 87.64 | 62.80 | |
INT8-Dynamic | 78.01 | 74.84 | 86.96 | 67.07 | |
INT4-GPTQ | 77.19 | 73.26 | 86.43 | 62.20 | |
INT4-AWQ | 76.15 | 73.59 | 86.96 | 63.41 | |
Qwen3-14B | BF16 | 83.06 | 78.90 | 88.40 | 55.49 |
FP8-Static | 82.62 | 78.57 | 89.46 | 57.32 | |
FP8-Dynamic | 82.24 | 78.92 | 88.32 | 52.44 | |
INT8-Dynamic | 81.87 | 78.13 | 86.28 | 56.10 | |
INT4-GPTQ | 81.05 | 78.02 | 87.34 | 57.93 | |
INT4-AWQ | 82.02 | 77.68 | 84.23 | 61.59 | |
Qwen3-32B | BF16 | 86.55 | 82.00 | 74.53 | 37.80 |
FP8-Static | 86.92 | 81.78 | 70.20 | 39.63 | |
FP8-Dynamic | 86.55 | 81.89 | 70.43 | 38.41 | |
INT4-GPTQ | 86.18 | 81.01 | - | 43.29 | |
INT4-AWQ | 86.18 | 81.54 | - | 36.59 | |
Qwen3-30B-A3B | BF16 | 83.66 | 79.36 | 89.99 | 31.71 |
FP8-Static | 83.95 | 79.47 | 89.01 | 31.10 | |
FP8-Dynamic | 84.10 | 79.40 | 89.16 | 32.93 | |
INT8-Dynamic | 83.36 | 79.48 | 89.16 | 34.15 | |
Qwen3-235B-A22B | BF16 | 89.60 | 86.28 | 85.29 | 27.44 |
FP8-Static | 89.67 | 86.19 | 86.96 | 27.44 | |
FP8-Dynamic | 89.67 | 86.18 | 85.22 | 28.05 | |
INT8-Dynamic | 88.93 | 86.20 | 86.20 | 23.78 | |
QwQ-32B | BF16 | 85.74 | 82.03 | 73.31 | 42.68 |
FP8-Static | 85.44 | 81.91 | 75.36 | 42.68 | |
FP8-Dynamic | 85.07 | 81.93 | 75.66 | 42.07 | |
INT4-GPTQ | 84.03 | 81.26 | 68.23 | 45.73 | |
INT4-AWQ | 83.58 | 81.01 | 68.69 | 43.29 |
Qwen2.5VL系列模型的BF16
、FP8-Static
、FP8-Dynamic
、INT4-GPTQ
、INT4-AWQ
在MMMU_VAL
、DocVQA_VAL
、ChartQA_TEST
上的评测结果如下:
Model | Quantization | MMMU_VAL | MMLDocVQA_VALU | ChartQA_TEST |
---|---|---|---|---|
Qwen2.5VL-3B | BF16 | 47.11 | 78.57 | 80.32 |
FP8-Static | 47.33 | 79.34 | 79.68 | |
FP8-Dynamic | 45.99 | 46.93 | 38.29 | |
INT4-GPTQ | 46.56 | 77.20 | 78.96 | |
INT4-AWQ | 45.78 | - | 79.60 | |
Qwen2.5VL-7B | BF16 | 45.44 | 89.71 | 84.64 |
FP8-Static | 47.00 | 89.83 | 85.92 | |
FP8-Dynamic | 47.22 | 89.80 | 88.64 | |
INT4-GPTQ | 46.67 | 90.45 | - | |
INT4-AWQ | 45.67 | 89.28 | - | |
Qwen2.5VL-32B | BF16 | 57.00 | 90.03 | - |
FP8-Static | 57.00 | 89.88 | - | |
FP8-Dynamic | 56.44 | 89.88 | - | |
INT4-GPTQ | 55.22 | 89.80 | - | |
INT4-AWQ | 55.22 | 90.30 | - | |
Qwen2.5VL-72B | BF16 | 58.78 | 94.39 | 85.60 |
FP8-Static | 57.89 | 94.41 | 85.84 | |
FP8-Dynamic | 58.67 | 94.38 | 85.60 | |
INT4-GPTQ | 57.56 | 94.46 | 86.48 | |
INT4-AWQ | 58.78 | 94.19 | 87.28 |
DeepSeek-R1-0528模型的FP8-Block-Wise
、W4A8-FP8
在GPQA Diamond
、AIME 2024
、SimpleQA
、LiveCodeBench
上的评测结果如下:
Model | Quantization | GPQA Diamond | AIME 2024 | SimpleQA | LiveCodeBench |
---|---|---|---|---|---|
DeepSeek-R1-0528 | FP8-Block-Wise | 78.28 | 88.67 | 27.8 | 77.1 |
W4A8-FP8 | 77.37 | 88.67 | 26.83 | 78.86 |
备注:
- 以上评测结果使用TRT-LLM框架部署测试5次求平均
- 评测时使用的超参如下:
{ "top_k": 20, "top_p": 0.6, "temperature": 0.7, "output_seq_len": 32768, "max_input_seq_len": 16384 }
其他模型的BF16
、FP8-Static
、FP8-Dynamic
、INT4-GPTQ
、INT4-AWQ
在CEVAL
、MMLU
、GSM8K
上的评测结果如下:
Model | Quantization | CEVAL | MMLU | GSM8K |
---|---|---|---|---|
Qwen2.5-1.5B-Instruct | BF16 | 67.01 | 60.05 | 54.28 |
FP8-Static | 66.27 | 60.23 | - | |
FP8-Dynamic | 66.79 | 60.08 | 51.71 | |
Qwen2.5-7B-Instruct | BF16 | 81.20 | 74.55 | 79.98 |
FP8-Static | 81.13 | 74.03 | 79.30 | |
FP8-Dynamic | 80.31 | 74.07 | 79.00 | |
INT4-GPTQ | 79.05 | 73.05 | 74.75 | |
INT4-AWQ | 79.35 | 73.22 | 79.38 | |
Qwen2.5-32B-Instruct | BF16 | 87.30 | 83.21 | 81.73 |
FP8-Static | 87.59 | 83.08 | 81.58 | |
FP8-Dynamic | 87.30 | 83.04 | 81.58 | |
INT4-GPTQ | 86.70 | 82.45 | 82.03 | |
INT4-AWQ | 87.00 | 82.64 | - | |
DeepSeek-R1-Distill-Qwen-7B | BF16 | 53.49 | 53.80 | 75.74 |
FP8-Static | 53.57 | 54.17 | 76.19 | |
FP8-Dynamic | 52.97 | 54.13 | 74.15 | |
INT4-GPTQ | 51.86 | 52.44 | 75.89 | |
INT4-AWQ | 53.49 | 53.70 | - | |
DeepSeek-R1-Distill-Qwen-14B | BF16 | 77.71 | 74.28 | 85.67 |
FP8-Static | 77.56 | 74.66 | 86.73 | |
FP8-Dynamic | 76.82 | 74.63 | 87.11 | |
INT4-GPTQ | 74.29 | 72.37 | 84.61 | |
INT4-AWQ | 74.81 | 73.00 | 86.05 | |
DeepSeek-R1-Distill-Qwen-32B | BF16 | 84.18 | 80.89 | 87.41 |
FP8-Static | 83.43 | 80.90 | 87.57 | |
FP8-Dynamic | 83.73 | 81.10 | 86.43 | |
INT4-GPTQ | 84.10 | 79.80 | 86.73 | |
INT4-AWQ | 82.84 | 80.15 | 87.19 |
Qwen3系列的Eagle3模型在MT-bench/HunmanEval/GSM8K/Alpaca上的加速结果如下:
  |   | MT-bench | HumanEval | GSM8K | Alpaca | Mean | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
Temperature | Model | Speedup | τ | Speedup | τ | Speedup | τ | Speedup | τ | Speedup | τ |
T=0 | Qwen3-1.7B | 2.05x | 2.81 | 2.07x | 2.93 | 2.11x | 2.98 | 1.93x | 2.69 | 2.04x | 2.85 |
Qwen3-4B | 2.21x | 3.01 | 2.36x | 3.24 | 2.42x | 3.13 | 2.32x | 2.75 | 2.33x | 3.03 | |
Qwen3-8B | 2.63x | 3.65 | 2.76x | 3.85 | 2.82x | 3.90 | 2.62x | 3.48 | 2.70x | 3.72 | |
Qwen3-14B | 2.23x | 3.30 | 2.53x | 3.74 | 2.56x | 3.79 | 2.16x | 3.13 | 2.37x | 3.49 | |
Qwen3-32B | 2.39x | 2.78 | 2.37x | 2.81 | 2.47x | 2.92 | 2.42x | 2.53 | 2.41x | 2.76 | |
Qwen3-30B-A3B | 2.84x | 3.63 | 2.27x | 3.09 | 2.64x | 3.42 | 2.83x | 3.56 | 2.64x | 3.42 | |
T=1 | Qwen3-1.7B | 1.74x | 2.53 | 1.86x | 2.70 | 1.82x | 2.69 | 1.72x | 2.46 | 1.93x | 2.60 |
Qwen3-4B | 1.93x | 2.60 | 2.00x | 2.84 | 2.11x | 2.82 | 2.34x | 2.50 | 1.75x | 2.69 | |
Qwen3-8B | 1.98x | 2.75 | 2.25x | 3.11 | 2.31x | 3.15 | 2.10x | 2.76 | 2.90x | 2.94 | |
Qwen3-14B | 1.71x | 2.61 | 1.95x | 2.87 | 2.04x | 3.08 | 1.68x | 2.55 | 2.90x | 2.78 | |
Qwen3-32B | 1.62x | 1.91 | 1.71x | 2.05 | 1.78x | 2.10 | 1.80x | 1.95 | 1.62x | 2.00 | |
Qwen3-30B-A3B | 1.91x | 2.46 | 2.00x | 2.64 | 1.90x | 2.53 | 1.80x | 2.32 | 1.90x | 2.48 |
Hunyuan系列的Eagle3模型在MT-bench/HunmanEval/GSM8K/Alpaca上的加速结果如下:
  |   | MT-bench | HumanEval | GSM8K | Alpaca | Mean | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
Temperature | Model | Speedup | τ | Speedup | τ | Speedup | τ | Speedup | τ | Speedup | τ |
T=0 | Hunyuan-1.8B-Instruct | 1.97x | 2.90 | 2.58x | 3.73 | 2.61x | 3.71 | 1.71x | 2.43 | 2.22x | 3.19 |
Hunyuan-4B-Instruct | 1.77x | 2.60 | 2.64x | 3.35 | 2.14x | 3.17 | 1.72x | 2.57 | 2.07x | 2.92 | |
Hunyuan-7B-Instruct | 2.22x | 3.58 | 3.59x | 5.47 | 2.96x | 4.68 | 1.64x | 2.56 | 2.60x | 4.07 | |
T=1 | Hunyuan-1.8B-Instruct | 1.58x | 2.36 | 2.35x | 3.56 | 2.23x | 3.38 | 1.26x | 1.87 | 1.86x | 2.79 |
Hunyuan-4B-Instruct | 1.36x | 2.05 | 1.97x | 2.86 | 1.72x | 2.68 | 1.14x | 1.76 | 1.55x | 2.34 | |
Hunyuan-7B-Instruct | 1.90x | 3.11 | 3.12x | 5.09 | 2.74x | 4.34 | 1.47x | 2.39 | 2.31x | 3.73 |
本项目的代码依照 License for AngelSlim 协议开源。
@software{AngelSlim2025,
title={{AngelSlim}},
author={Tencent AngelSlim Project Contributors},
year={2025},
month={7},
url={https://github.com/Tencent/AngelSlim},
}
- AngelSlim正在快速迭代更新中,后续会推出更多的功能,有问题或建议欢迎通过GitHub Issues给我们提issue,或者加入微信技术交流群。
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for AngelSlim
Similar Open Source Tools

AngelSlim
AngelSlim is a comprehensive and efficient large model compression toolkit designed to be user-friendly. It integrates mainstream compression algorithms for easy one-click access, continuously innovates compression algorithms, and optimizes end-to-end performance in model compression and deployment. It supports various models for quantization and speculative sampling, with a focus on performance optimization and ease of use.

Firefly
Firefly is an open-source large model training project that supports pre-training, fine-tuning, and DPO of mainstream large models. It includes models like Llama3, Gemma, Qwen1.5, MiniCPM, Llama, InternLM, Baichuan, ChatGLM, Yi, Deepseek, Qwen, Orion, Ziya, Xverse, Mistral, Mixtral-8x7B, Zephyr, Vicuna, Bloom, etc. The project supports full-parameter training, LoRA, QLoRA efficient training, and various tasks such as pre-training, SFT, and DPO. Suitable for users with limited training resources, QLoRA is recommended for fine-tuning instructions. The project has achieved good results on the Open LLM Leaderboard with QLoRA training process validation. The latest version has significant updates and adaptations for different chat model templates.

Llama-Chinese
Llama中文社区是一个专注于Llama模型在中文方面的优化和上层建设的高级技术社区。 **已经基于大规模中文数据,从预训练开始对Llama2模型进行中文能力的持续迭代升级【Done】**。**正在对Llama3模型进行中文能力的持续迭代升级【Doing】** 我们热忱欢迎对大模型LLM充满热情的开发者和研究者加入我们的行列。

awesome-ai-painting
This repository, named 'awesome-ai-painting', is a comprehensive collection of resources related to AI painting. It is curated by a user named 秋风, who is an AI painting enthusiast with a background in the AIGC industry. The repository aims to help more people learn AI painting and also documents the user's goal of creating 100 AI products, with current progress at 4/100. The repository includes information on various AI painting products, tutorials, tools, and models, providing a valuable resource for individuals interested in AI painting and related technologies.

Chinese-Mixtral-8x7B
Chinese-Mixtral-8x7B is an open-source project based on Mistral's Mixtral-8x7B model for incremental pre-training of Chinese vocabulary, aiming to advance research on MoE models in the Chinese natural language processing community. The expanded vocabulary significantly improves the model's encoding and decoding efficiency for Chinese, and the model is pre-trained incrementally on a large-scale open-source corpus, enabling it with powerful Chinese generation and comprehension capabilities. The project includes a large model with expanded Chinese vocabulary and incremental pre-training code.

Qwen-TensorRT-LLM
Qwen-TensorRT-LLM is a project developed for the NVIDIA TensorRT Hackathon 2023, focusing on accelerating inference for the Qwen-7B-Chat model using TRT-LLM. The project offers various functionalities such as FP16/BF16 support, INT8 and INT4 quantization options, Tensor Parallel for multi-GPU parallelism, web demo setup with gradio, Triton API deployment for maximum throughput/concurrency, fastapi integration for openai requests, CLI interaction, and langchain support. It supports models like qwen2, qwen, and qwen-vl for both base and chat models. The project also provides tutorials on Bilibili and blogs for adapting Qwen models in NVIDIA TensorRT-LLM, along with hardware requirements and quick start guides for different model types and quantization methods.

Element-Plus-X
Element-Plus-X is an out-of-the-box enterprise-level AI component library based on Vue 3 + Element-Plus. It features built-in scenario components such as chatbots and voice interactions, seamless integration with zero configuration based on Element-Plus design system, and support for on-demand loading with Tree Shaking optimization.

Step-DPO
Step-DPO is a method for enhancing long-chain reasoning ability of LLMs with a data construction pipeline creating a high-quality dataset. It significantly improves performance on math and GSM8K tasks with minimal data and training steps. The tool fine-tunes pre-trained models like Qwen2-7B-Instruct with Step-DPO, achieving superior results compared to other models. It provides scripts for training, evaluation, and deployment, along with examples and acknowledgements.

build_MiniLLM_from_scratch
This repository aims to build a low-parameter LLM model through pretraining, fine-tuning, model rewarding, and reinforcement learning stages to create a chat model capable of simple conversation tasks. It features using the bert4torch training framework, seamless integration with transformers package for inference, optimized file reading during training to reduce memory usage, providing complete training logs for reproducibility, and the ability to customize robot attributes. The chat model supports multi-turn conversations. The trained model currently only supports basic chat functionality due to limitations in corpus size, model scale, SFT corpus size, and quality.

LLMs
LLMs is a Chinese large language model technology stack for practical use. It includes high-availability pre-training, SFT, and DPO preference alignment code framework. The repository covers pre-training data cleaning, high-concurrency framework, SFT dataset cleaning, data quality improvement, and security alignment work for Chinese large language models. It also provides open-source SFT dataset construction, pre-training from scratch, and various tools and frameworks for data cleaning, quality optimization, and task alignment.

MiniCPM
MiniCPM is a series of open-source large models on the client side jointly developed by Face Intelligence and Tsinghua University Natural Language Processing Laboratory. The main language model MiniCPM-2B has only 2.4 billion (2.4B) non-word embedding parameters, with a total of 2.7B parameters. - After SFT, MiniCPM-2B performs similarly to Mistral-7B on public comprehensive evaluation sets (better in Chinese, mathematics, and code capabilities), and outperforms models such as Llama2-13B, MPT-30B, and Falcon-40B overall. - After DPO, MiniCPM-2B also surpasses many representative open-source large models such as Llama2-70B-Chat, Vicuna-33B, Mistral-7B-Instruct-v0.1, and Zephyr-7B-alpha on the current evaluation set MTBench, which is closest to the user experience. - Based on MiniCPM-2B, a multi-modal large model MiniCPM-V 2.0 on the client side is constructed, which achieves the best performance of models below 7B in multiple test benchmarks, and surpasses larger parameter scale models such as Qwen-VL-Chat 9.6B, CogVLM-Chat 17.4B, and Yi-VL 34B on the OpenCompass leaderboard. MiniCPM-V 2.0 also demonstrates leading OCR capabilities, approaching Gemini Pro in scene text recognition capabilities. - After Int4 quantization, MiniCPM can be deployed and inferred on mobile phones, with a streaming output speed slightly higher than human speech speed. MiniCPM-V also directly runs through the deployment of multi-modal large models on mobile phones. - A single 1080/2080 can efficiently fine-tune parameters, and a single 3090/4090 can fully fine-tune parameters. A single machine can continuously train MiniCPM, and the secondary development cost is relatively low.

MMOS
MMOS (Mix of Minimal Optimal Sets) is a dataset designed for math reasoning tasks, offering higher performance and lower construction costs. It includes various models and data subsets for tasks like arithmetic reasoning and math word problem solving. The dataset is used to identify minimal optimal sets through reasoning paths and statistical analysis, with a focus on QA-pairs generated from open-source datasets. MMOS also provides an auto problem generator for testing model robustness and scripts for training and inference.

TigerBot
TigerBot is a cutting-edge foundation for your very own LLM, providing a world-class large model for innovative Chinese-style contributions. It offers various upgrades and features, such as search mode enhancements, support for large context lengths, and the ability to play text-based games. TigerBot is suitable for prompt-based game engine development, interactive game design, and real-time feedback for playable games.

SakuraLLM
SakuraLLM is a project focused on building large language models for Japanese to Chinese translation in the light novel and galgame domain. The models are based on open-source large models and are pre-trained and fine-tuned on general Japanese corpora and specific domains. The project aims to provide high-performance language models for galgame/light novel translation that are comparable to GPT3.5 and can be used offline. It also offers an API backend for running the models, compatible with the OpenAI API format. The project is experimental, with version 0.9 showing improvements in style, fluency, and accuracy over GPT-3.5.

cursor-ai-downloads
Cursor AI Downloads is a GitHub repository that provides comprehensive download links for different versions of Cursor AI, an AI code editor designed to enhance productivity. Users can easily update and choose any version of Cursor AI, including older versions for a better user experience. The repository also includes information about Cursor AI features, changelog, and specific download links for Mac, Windows, and Linux platforms.

speechless
Speechless.AI is committed to integrating the superior language processing and deep reasoning capabilities of large language models into practical business applications. By enhancing the model's language understanding, knowledge accumulation, and text creation abilities, and introducing long-term memory, external tool integration, and local deployment, our aim is to establish an intelligent collaborative partner that can independently interact, continuously evolve, and closely align with various business scenarios.
For similar tasks

pyllms
PyLLMs is a minimal Python library designed to connect to various Language Model Models (LLMs) such as OpenAI, Anthropic, Google, AI21, Cohere, Aleph Alpha, and HuggingfaceHub. It provides a built-in model performance benchmark for fast prototyping and evaluating different models. Users can easily connect to top LLMs, get completions from multiple models simultaneously, and evaluate models on quality, speed, and cost. The library supports asynchronous completion, streaming from compatible models, and multi-model initialization for testing and comparison. Additionally, it offers features like passing chat history, system messages, counting tokens, and benchmarking models based on quality, speed, and cost.

LLM-Fine-Tuning-Azure
A fine-tuning guide for both OpenAI and Open-Source Large Language Models on Azure. Fine-Tuning retrains an existing pre-trained LLM using example data, resulting in a new 'custom' fine-tuned LLM optimized for task-specific examples. Use cases include improving LLM performance on specific tasks and introducing information not well represented by the base LLM model. Suitable for cases where latency is critical, high accuracy is required, and clear evaluation metrics are available. Learning path includes labs for fine-tuning GPT and Llama2 models via Dashboards and Python SDK.

cellseg_models.pytorch
cellseg-models.pytorch is a Python library built upon PyTorch for 2D cell/nuclei instance segmentation models. It provides multi-task encoder-decoder architectures and post-processing methods for segmenting cell/nuclei instances. The library offers high-level API to define segmentation models, open-source datasets for training, flexibility to modify model components, sliding window inference, multi-GPU inference, benchmarking utilities, regularization techniques, and example notebooks for training and finetuning models with different backbones.

awesome-mobile-llm
Awesome Mobile LLMs is a curated list of Large Language Models (LLMs) and related studies focused on mobile and embedded hardware. The repository includes information on various LLM models, deployment frameworks, benchmarking efforts, applications, multimodal LLMs, surveys on efficient LLMs, training LLMs on device, mobile-related use-cases, industry announcements, and related repositories. It aims to be a valuable resource for researchers, engineers, and practitioners interested in mobile LLMs.

Medical_Image_Analysis
The Medical_Image_Analysis repository focuses on X-ray image-based medical report generation using large language models. It provides pre-trained models and benchmarks for CheXpert Plus dataset, context sample retrieval for X-ray report generation, and pre-training on high-definition X-ray images. The goal is to enhance diagnostic accuracy and reduce patient wait times by improving X-ray report generation through advanced AI techniques.

AngelSlim
AngelSlim is a comprehensive and efficient large model compression toolkit designed to be user-friendly. It integrates mainstream compression algorithms for easy one-click access, continuously innovates compression algorithms, and optimizes end-to-end performance in model compression and deployment. It supports various models for quantization and speculative sampling, with a focus on performance optimization and ease of use.

aimet
AIMET is a library that provides advanced model quantization and compression techniques for trained neural network models. It provides features that have been proven to improve run-time performance of deep learning neural network models with lower compute and memory requirements and minimal impact to task accuracy. AIMET is designed to work with PyTorch, TensorFlow and ONNX models. We also host the AIMET Model Zoo - a collection of popular neural network models optimized for 8-bit inference. We also provide recipes for users to quantize floating point models using AIMET.

hqq
HQQ is a fast and accurate model quantizer that skips the need for calibration data. It's super simple to implement (just a few lines of code for the optimizer). It can crunch through quantizing the Llama2-70B model in only 4 minutes! 🚀
For similar jobs

weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.