
GPTQModel
Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
Stars: 371

GPTQModel is an easy-to-use LLM quantization and inference toolkit based on the GPTQ algorithm. It provides support for weight-only quantization and offers features such as dynamic per layer/module flexible quantization, sharding support, and auto-heal quantization errors. The toolkit aims to ensure inference compatibility with HF Transformers, vLLM, and SGLang. It offers various model supports, faster quant inference, better quality quants, and security features like hash check of model weights. GPTQModel also focuses on faster quantization, improved quant quality as measured by PPL, and backports bug fixes from AutoGPTQ.
README:
Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
- 03/13/2025 2.1.1-dev: Auto
bfloat16
dtype loading for models based on model config. Fix kernel compile for Pytorch/ROCm. - 03/12/2025 2.1.0: β¨ New
QQQ
quantization method and inference support! New GoogleGemma 3
zero-day model support. New AlibabaOvis 2
VL model support. New AMDInstella
zero-day model model support. NewGSM8K Platinum
andMMLU-Pro
benchmarking suppport. Peft Lora training with GPTQModel is now 30%+ faster on all gpu and IPEX devices. Auto detect MoE modules not activated during quantization due to insufficient calibration data.ROCm
setup.py
compat fixes.Optimum
andPeft
compat fixes. FixedPeft
bfloat16
training. - 03/03/2025 2.0.0: π
GPTQ
quantization internals are now broken into multiple stages (processes) for feature expansion. SyncedMarlin
kernel inference quality fix from upstream. AddedMARLIN_FP16
, lower-quality but faster backend.ModelScope
support added. Logging and cli progress bar output has been revamped with sticky bottom progress. Fixedgeneration_config.json
save and load. Fixed Transformers v4.49.0 compat. Fixed compat of models withoutbos
. Fixedgroup_size=-1
andbits=3
packing regression. Fixed Qwen 2.5 MoE regressions. Added CI tests to track regression in kernel inference quality and sweep all bits/group_sizes. Delegate loggin/progressbar to LogBar pkg. Fix ROCm version auto detection insetup
install. - 02/12/2025 1.9.0: β‘ Offload
tokenizer
fixes to Toke(n)icer pkg. Optimizedlm_head
quant time and vram usage. OptimizedDeepSeek v3/R1
model quant vram usage. FixedOptimum
compat regresion inv1.8.1
. 3x speed-up forTorch
kernel when using Pytorch >= 2.5.0 withmodel.optimize()
. Newcalibration_dataset_concat_size
option to enable calibration dataconcat
mode to mimic original GPTQ data packing strategy which may improve quant speed and accuracy for datasets likewikitext2
.
Archived News
* 02/08/2025 [1.8.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.8.1): β‘ `DeepSeek v3/R1` model support. New flexible weight `packing`: allow quantized weights to be packed to `[int32, int16, int8]` dtypes. `Triton` and `Torch` kernels supports full range of new `QuantizeConfig.pack_dtype`. New `auto_gc: bool` control in `quantize()` which can reduce quantization time for small model with no chance of oom. New `GPTQModel.push_to_hub()` api for easy quant model upload to HF repo. New `buffered_fwd: bool` control in `model.quantize()`. Over 50% quantization speed-up for visual (vl) models. Fixed `bits=3` packing and `group_size=-1` regression in v1.7.4.-
01/26/2025 1.7.4: New
compile()
api for ~4-8% inference tps improvement. Fasterpack()
for post-quantiztion model save.Triton
kernel validated for Intel/XPU
when Intel Triton packages are installed. Fixed Transformers (bug) downcasting tokenizer class on save. -
01/20/2025 1.7.3: New Telechat2 (China Telecom) and PhiMoE model support. Fixed
lm_head
weights duplicated in post-quantize save() for models with tied-embedding. -
01/19/2025 1.7.2: Effective BPW (bits per weight) will now be logged during
load()
. Reduce loading time on Intel Arc A770/B580XPU
by 3.3x. Reduce memory usage in MLX conversion and fix Marlin kernel auto-select not checking CUDA compute version. -
01/17/2025 1.7.0: π β¨
backend.MLX
added for runtime-conversion and execution of GPTQ models on Apple'sMLX
framework on Apple Silicon (M1+). Exports ofgptq
models tomlx
also now possible. We have addedmlx
exported models to huggingface.co/ModelCloud. β¨lm_head
quantization now fully support by GPTQModel without external pkg dependency. -
01/07/2025 1.6.1: π New OpenAI api compatible end-point via
model.serve(host, port)
. Auto-enable flash-attention2 for inference. Fixedsym=False
loading regression. -
01/06/2025 1.6.0: β‘25% faster quantization. 35% reduction in vram usage vs v1.5. π AMD ROCm (6.2+) support added and validated for 7900XT+ GPU. Auto-tokenizer loader via
load()
api. For most models you no longer need to manually init a tokenizer for both inference and quantization. -
01/01/2025 1.5.1: π 2025! Added
QuantizeConfig.device
to clearly define which device is used for quantization: default =auto
. Non-quantized models are always loaded on cpu by-default and each layer is moved toQuantizeConfig.device
during quantization to minimize vram usage. Compatibility fixes forattn_implementation_autoset
in latest transformers. -
12/23/2024 1.5.0: Multi-modal (image-to-text) optimized quantization support has been added for Qwen 2-VL and Ovis 1.6-VL. Previous image-to-text model quantizations did not use image calibration data, resulting in less than optimal post-quantization results. Version 1.5.0 is the first release to provide a stable path for multi-modal quantization: only text layers are quantized.
-
12/19/2024 1.4.5: Windows 11 support added/validated. Ovis VL model support with image dataset calibration. Fixed
dynamic
loading. Reduced quantization vram usage. -
12/15/2024 1.4.2: MacOS
gpu
(Metal) andcpu
(M+) support added/validated for inference and quantization. Cohere 2 model support added. -
12/13/2024 1.4.1: Added Qwen2-VL model support.
mse
quantization control exposed inQuantizeConfig
. Monkey patchpatch_vllm()
andpatch_hf()
api added to allow Transformers/Optimum/PEFT and vLLM to correctly loaded GPTQModel quantized models while upstream PRs are in pending status. -
12/10/2024 1.4.0
EvalPlus
harness integration merged upstream. We now support bothlm-eval
andEvalPlus
. Added pure torchTorch
kernel. RefactoredCuda
kernel to beDynamicCuda
kernel.Triton
kernel now auto-padded for max model support.Dynamic
quantization now supports both positive+:
:default, and-:
negative matching which allows matched modules to be skipped entirely for quantization. Fixed auto-Marlin
kerenl selection. Added auto-kernel fallback for unsupported kernel/module pairs. Lots of internal refractor and cleanup in-preparation for transformers/optimum/peft upstream PR merge. Deprecated the saving ofMarlin
weight format sinceMarlin
supports auto conversion ofgptq
format toMarlin
during runtime. -
11/29/2024 1.3.1 Olmo2 model support. Intel XPU acceleration via IPEX. Model sharding Transformer compat fix due to api deprecation in HF. Removed triton dependency. Triton kernel now optionally dependent on triton pkg.
-
11/26/2024 1.3.0 Zero-Day Hymba model support. Removed
tqdm
androgue
dependency. -
11/24/2024 1.2.3 HF GLM model support. ClearML logging integration. Use
device-smi
and replacegputil
+psutil
depends. Fixed model unit tests. -
11/11/2024 π 1.2.1 Meta MobileLLM model support added.
lm-eval[gptqmodel]
integration merged upstream. Intel/IPEX cpu inference merged replacing QBits (deprecated). Auto-fix/patch ChatGLM-3/GLM-4 compat with latest transformers. New.load()
and.save()
api. -
10/29/2024 π 1.1.0 IBM Granite model support. Full auto-buildless wheel install from pypi. Reduce max cpu memory usage by >20% during quantization. 100% CI model/feature coverage.
-
10/12/2024 β¨ 1.0.9 Move AutoRound to optional and fix pip install regression in v1.0.8.
-
10/11/2024 β¨ 1.0.8 Add wheel for python 3.12 and cuda 11.8.
-
10/08/2024 β¨ 1.0.7 Fixed marlin (faster) kernel was not auto-selected for some models.
-
09/26/2024 β¨ 1.0.6 Fixed quantized Llama 3.2 vision quantized loader.
-
09/26/2024 β¨ 1.0.5 Partial Llama 3.2 Vision model support (mllama): only text-layer quantization layers are supported for now.
-
09/26/2024 β¨ 1.0.4 Integrated Liger Kernel support for ~1/2 memory reduction on some models during quantization. Added control toggle disable parallel packing.
-
09/18/2024 β¨ 1.0.3 Added Microsoft GRIN-MoE and MiniCPM3 support.
-
08/16/2024 β¨ 1.0.2 Support Intel/AutoRound v0.3, pre-built whl packages, and PyPI release.
-
08/14/2024 β¨ 1.0.0 40% faster
packing
, Fixed Python 3.9 compat, addedlm_eval
api. -
08/10/2024 π 0.9.11 Added LG EXAONE 3.0 model support. New
dynamic
per layer/module flexible quantization where each layer/module may have different bits/params. Added proper sharding support tobackend.BITBLAS
. Auto-heal quantization errors due to small damp values. -
07/31/2024 π 0.9.10 Ported vllm/nm
gptq_marlin
inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models withFORMAT.GPTQ
. Auto calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized() called on pre-quantized models with non-supported backends. HF transformers depend updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference. -
07/25/2024 π 0.9.9: Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang, and more.
-
07/13/2024 π 0.9.8: Run quantized models directly using GPTQModel using fast
vLLM
orSGLang
backend! Both vLLM and SGLang are optimized for dyanamic batching inference for maximumTPS
(check usage under examples). Marlin backend also got full end-to-end in/out features padding to enhance current/future model compatibility. -
07/08/2024 π 0.9.7: InternLM 2.5 model support added.
-
07/08/2024 π 0.9.6: Intel/AutoRound QUANT_METHOD support added for a potentially higher quality quantization with
lm_head
module quantization support for even more vram reduction: format export toFORMAT.GPTQ
for max inference compatibility. -
07/05/2024 π 0.9.5: Cuda kernels have been fully deprecated in favor of Exllama(v1/v2)/Marlin/Triton.
-
07/03/2024 π 0.9.4: HF Transformers integration added and bug fixed Gemma 2 support.
-
07/02/2024 π 0.9.3: Added Gemma 2 support, faster PPL calculations on gpu, and more code/arg refractor.
-
06/30/2024 π 0.9.2: Added auto-padding of model in/out-features for exllama and exllama v2. Fixed quantization of OPT and DeepSeek V2-Lite models. Fixed inference for DeepSeek V2-Lite.
-
06/29/2024 π 0.9.1: With 3 new models (DeepSeek-V2, DeepSeek-V2-Lite, DBRX Converted), BITBLAS new format/kernel, proper batching of calibration dataset resulting > 50% quantization speedup, security hash check of loaded model weights, tons of refractor/usability improvements, bugs fixes and much more.
-
06/20/2924 β¨ 0.9.0: Thanks for all the work from ModelCloud team and the opensource ML community for their contributions!
GPTQModel is a production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF Transformers, vLLM, and SGLang.
Public and ModelCloud's internal tests have shown that GPTQ is on-par and/or exceeds other 4bit quantization methods in terms of both quality recovery and production-level inference speed for token latency and rps. GPTQ has the optimal blend of quality and inference speed you need in a real-world production deployment.
GPTQModel not only supports GPTQ but also QQQ with more quantization methods support planned.
GPTQModel is an expandable/modular design supporting multiple quantization methods.
Quantization | GPTQModel | Transformers | vLLM | SGLang | Lora Training |
---|---|---|---|---|---|
GPTQ | β | β | β | β | β |
QQQ + Rotation | β | x | β | β | x |
Native support support some of the most popular multi-modal models:
Multi-Modal | |
---|---|
Qwen2 VL | β |
Ovis 1.6 + 2 | β |
- β¨ Native integration with HF Transformers, Optimum, and Peft (main)
- π vLLM and SGLang inference integration for quantized model with format =
FORMAT.GPTQ
- π Extensive model support for:
Ovis VL
,Llama 1-3.3
,Qwen2-VL
,Olmo2
,Hymba
,GLM
,IBM Granite
,Llama 3.2 Vision
,MiniCPM3
,GRIN-Moe
,Phi 1-4
,EXAONE 3.0
,InternLM 2.5
,Gemma 2
,DeepSeek-V2
,DeepSeek-V2-Lite
,ChatGLM
,MiniCPM
,Qwen2MoE
,DBRX
. - β¨ Linux, MacOS, Windows platform quantization and accelerated inference support for CUDA (Nvidia), XPU (Intel), ROCm (AMD), MPS (Apple Silicon), CPU (Intel/AMD/Apple Silicon).
- π― 100% CI unit-test coverage for all supported models and kernels including post-quantization quality regression.
- β¨
Dynamic
mixed quantization control on a per-module basis. Each layer/module can have a unique quantization config or be excluded from quantization all together. - π Intel/IPEX hardware accelerated quantization/inference for CPU [
avx
,amx
,xmx
] and Intel GPU [Arc
+Datacenter Max
]. - π Microsoft/BITBLAS format + dynamically compiled inference.
- β¨ Intel/AutoRound alternative gptq-inference compatible quantization method.
- β¨ Asymmetric
Sym=False
support. Model weights sharding support with optional hash check of model weights on load. - β¨
lm_head
module quant inference support for further VRAM reduction. - π 45% faster
packing
stage in quantization (Llama 3.1 8B). 50% faster PPL calculations (OPT).
π€ ModelCloud quantized Vortex models on HF
Model | |||||||||
---|---|---|---|---|---|---|---|---|---|
Baichuan | β | Falcon | β | InternLM 1/2.5 | β | OPT | β | TeleChat2 | β |
Bloom | β | Gemma 1/2/3 | β | Llama 1-3.3 | β | OLMo2 | β | Yi | β |
ChatGLM | β | GPTBigCod | β | Llama 3.2 VL | β | Ovis 1.6/2 | β | XVERSE | β |
CodeGen | β | GPTNeoX | β | LongLLaMA | β | Phi 1-4 | β | ||
Cohere 1-2 | β | GPT-2 | β | MiniCPM3 | β | Qwen | β | ||
DBRX Converted | β | GPT-J | β | Mistral | β | Qwen2 MoE | β | ||
Deci | β | Granite | β | Mixtral | β | Qwen2 VL | β | ||
DeepSeek-V2/V3/R1 | β | GRIN-MoE | β | MobileLLM | β | RefinedWeb | β | ||
DeepSeek-V2-Lite | β | Hymba | β | MOSS | β | StableLM | β | ||
EXAONE 3.0 | β | Instella | β | MPT | β | StarCoder2 | β |
GPTQModel is validated for Linux, MacOS, and Windows 11:
Platform | Device | Optimized Arch | Kernels | |
---|---|---|---|---|
π§ Linux | Nvidia GPU | β | Ampere+ |
Marlin, Exllama V2, Exallma V1, Triton, Torch |
π§ Linux | Intel XPU | β |
Arc , Datacenter Max
|
IPEX, Torch |
π§ Linux | AMD GPU | β |
7900XT+ , ROCm 6.2+
|
Exllama V2, Exallma V1, Torch |
π§ Linux | Intel/AMD CPU | β |
avx , amx , xmx
|
IPEX, Torch |
π MacOS | GPU (Metal) / CPU | β |
Apple Silicon , M1+
|
Torch, MLX via conversion |
πͺ Windows | GPU (Nvidia) / CPU | β | Nvidia |
Torch |
# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.
# Example: pip install -v --no-build-isolation gptqmodel[vllm,sglang,bitblas,ipex,auto_round]
pip install -v gptqmodel --no-build-isolation
uv pip install -v gptqmodel --no-build-isolation
# clone repo
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel
# pip: compile and install
# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.
# Example: pip install -v --no-build-isolation .[vllm,sglang,bitblas,ipex,auto_round]
pip install -v . --no-build-isolation
Three line api to use GPTQModel
for gptq model inference:
from gptqmodel import GPTQModel
model = GPTQModel.load("ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5")
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output
To use models from ModelScope instead of HuggingFace Hub, set an environment variable:
export GPTQMODEL_USE_MODELSCOPE=True
from gptqmodel import GPTQModel
# load Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4 from modelscope
model = GPTQModel.load("Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4")
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output
# load model using above inference guide first
model.serve(host="0.0.0.0",port="12345")
Basic example of using GPTQModel
to quantize a llm model:
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"
calibration_dataset = load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(1024))["text"]
quant_config = QuantizeConfig(bits=4, group_size=128)
model = GPTQModel.load(model_id, quant_config)
# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=2)
model.save(quant_path)
# test post-quant inference
model = GPTQModel.load(quant_path)
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output
GPTQModel now support EoRA, a LoRA method that can further imporve the accuracy of the quantized model
# higher rank improves accuracy at the cost of vram usage
# suggestion: test rank 64 and 32 before 128 or 256 as latter may overfit while increasing memory usage
eora = Lora(
# for eora generation, path is adapter save path; for load, it is loading path
path=f"{quant_path}/eora_rank32",
rank=32,
)
# provide a previously gptq quantized model path
GPTQModel.adapter.generate(
adapter=eora,
model_id_or_path=model_id,
quantized_model_id_or_path=quant_path,
calibration_dataset=calibration_dataset,
calibration_dataset_concat_size=0,
auto_gc=False)
# post-eora inference
model = GPTQModel.load(
model_id_or_path=quant_path,
adapter=eora
)
tokens = model.generate("Capital of France is")[0]
result = model.tokenizer.decode(tokens)
print(f"Result: {result}")
# For more detail of EoRA please see GPTQModel/examples/eora
# Please use the benchmark tools in later part of this README to evaluate EoRA effectiveness
For more advanced features of model quantization, please reference to this script
Read the gptqmodel/models/llama.py
code which explains in detail via comments how the model support is defined. Use it as guide to PR for to new models. Most models follow the same pattern.
GPTQModel inference is integrated into both lm-eval and evalplus
We highly recommend avoid using ppl
and use lm-eval
/evalplus
to validate post-quantization model quality. ppl
should only be used for regression tests and is not a good indicator of model output quality.
# gptqmodel is integrated into lm-eval >= v0.4.7
pip install lm-eval>=0.4.7
# gptqmodel is integrated into evalplus[main]
pip install -U "evalplus @ git+https://github.com/evalplus/evalplus"
Below is a basic sample using GPTQModel.eval
API
from gptqmodel import GPTQModel
from gptqmodel.utils.eval import EVAL
model_id = "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1"
# Use `lm-eval` as framework to evaluate the model
lm_eval_results = GPTQModel.eval(model_id, framework=EVAL.LM_EVAL, tasks=[EVAL.LM_EVAL.ARC_CHALLENGE], output_file='lm-eval_result.json')
# Use `evalplus` as framework to evaluate the model
evalplus_results = GPTQModel.eval(model_id, framework=EVAL.EVALPLUS, tasks=[EVAL.EVALPLUS.HUMAN], output_file='evalplus_result.json')
QuantizeConfig.dynamic
is dynamic control which allows specific matching modules
to be skipped for quantization (negative matching)
or have a unique [bits, group_size, sym, desc_act, mse, pack_dtype]
property override per matching module
vs base QuantizeConfig
(postive match with override).
Sample QuantizerConfig.dynamic
usage:
dynamic = {
# `.*\.` matches the layers_node prefix
# layer index start at 0
# positive match: layer 19, gate module
r"+:.*\.18\..*gate.*": {"bits": 4, "group_size": 32},
# positgive match: layer 20, gate module (prefix defaults to positive if missing)
r".*\.19\..*gate.*": {"bits": 8, "group_size": 64},
# negative match: skip layer 21, gate module
r"-:.*\.20\..*gate.*": {},
# negative match: skip all down modules for all layers
r"-:.*down.*": {},
}
# GPTQModel
@misc{gptqmodel,
author = {ModelCloud.ai and qubitium@modelcloud.ai},
title = {GPTQModel},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/modelcloud/gptqmodel}},
note = {Contact: qubitium@modelcloud.ai}
}
# GPTQ
@article{frantar-gptq,
title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers},
author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
year={2022},
journal={arXiv preprint arXiv:2210.17323}
}
# GPTQ Marlin Kernel
@article{frantar2024marlin,
title={MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models},
author={Frantar, Elias and Castro, Roberto L and Chen, Jiale and Hoefler, Torsten and Alistarh, Dan},
journal={arXiv preprint arXiv:2408.11743},
year={2024}
}
# QQQ
@article{zhang2024qqq,
title={QQQ: Quality Quattuor-Bit Quantization for Large Language Models},
author={Ying Zhang and Peng Zhang and Mincong Huang and Jingyang Xiang and Yujie Wang and Chao Wang and Yineng Zhang and Lei Yu and Chuan Liu and Wei Lin},
journal={arXiv preprint arXiv:2406.09904},
year={2024}
}
# EoRA
@article{liu2024eora,
title={EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation},
author={Liu, Shih-Yang and Yang, Huck and Wang, Chien-Yi and Fung, Nai Chit and Yin, Hongxu and Sakr, Charbel and Muralidharan, Saurav and Cheng, Kwang-Ting and Kautz, Jan and Wang, Yu-Chiang Frank and others},
journal={arXiv preprint arXiv:2410.21271},
year={2024}
}
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for GPTQModel
Similar Open Source Tools

GPTQModel
GPTQModel is an easy-to-use LLM quantization and inference toolkit based on the GPTQ algorithm. It provides support for weight-only quantization and offers features such as dynamic per layer/module flexible quantization, sharding support, and auto-heal quantization errors. The toolkit aims to ensure inference compatibility with HF Transformers, vLLM, and SGLang. It offers various model supports, faster quant inference, better quality quants, and security features like hash check of model weights. GPTQModel also focuses on faster quantization, improved quant quality as measured by PPL, and backports bug fixes from AutoGPTQ.

MaskLLM
MaskLLM is a learnable pruning method that establishes Semi-structured Sparsity in Large Language Models (LLMs) to reduce computational overhead during inference. It is scalable and benefits from larger training datasets. The tool provides examples for running MaskLLM with Megatron-LM, preparing LLaMA checkpoints, pre-tokenizing C4 data for Megatron, generating prior masks, training MaskLLM, and evaluating the model. It also includes instructions for exporting sparse models to Huggingface.

langcheck
LangCheck is a Python library that provides a suite of metrics and tools for evaluating the quality of text generated by large language models (LLMs). It includes metrics for evaluating text fluency, sentiment, toxicity, factual consistency, and more. LangCheck also provides tools for visualizing metrics, augmenting data, and writing unit tests for LLM applications. With LangCheck, you can quickly and easily assess the quality of LLM-generated text and identify areas for improvement.

yomitoku
YomiToku is a Japanese-focused AI document image analysis engine that provides full-text OCR and layout analysis capabilities for images. It recognizes, extracts, and converts text information and figures in images. It includes 4 AI models trained on Japanese datasets for tasks such as detecting text positions, recognizing text strings, analyzing layouts, and recognizing table structures. The models are specialized for Japanese document images, supporting recognition of over 7000 Japanese characters and analyzing layout structures specific to Japanese documents. It offers features like layout analysis, table structure analysis, and reading order estimation to extract information from document images without disrupting their semantic structure. YomiToku supports various output formats such as HTML, markdown, JSON, and CSV, and can also extract figures, tables, and images from documents. It operates efficiently in GPU environments, enabling fast and effective analysis of document transcriptions without requiring high-end GPUs.

ms-swift
ms-swift is an official framework provided by the ModelScope community for fine-tuning and deploying large language models and multi-modal large models. It supports training, inference, evaluation, quantization, and deployment of over 400 large models and 100+ multi-modal large models. The framework includes various training technologies and accelerates inference, evaluation, and deployment modules. It offers a Gradio-based Web-UI interface and best practices for easy application of large models. ms-swift supports a wide range of model types, dataset types, hardware support, lightweight training methods, distributed training techniques, quantization training, RLHF training, multi-modal training, interface training, plugin and extension support, inference acceleration engines, model evaluation, and model quantization.

agentops
AgentOps is a toolkit for evaluating and developing robust and reliable AI agents. It provides benchmarks, observability, and replay analytics to help developers build better agents. AgentOps is open beta and can be signed up for here. Key features of AgentOps include: - Session replays in 3 lines of code: Initialize the AgentOps client and automatically get analytics on every LLM call. - Time travel debugging: (coming soon!) - Agent Arena: (coming soon!) - Callback handlers: AgentOps works seamlessly with applications built using Langchain and LlamaIndex.

rust-genai
genai is a multi-AI providers library for Rust that aims to provide a common and ergonomic single API to various generative AI providers such as OpenAI, Anthropic, Cohere, Ollama, and Gemini. It focuses on standardizing chat completion APIs across major AI services, prioritizing ergonomics and commonality. The library initially focuses on text chat APIs and plans to expand to support images, function calling, and more in the future versions. Version 0.1.x will have breaking changes in patches, while version 0.2.x will follow semver more strictly. genai does not provide a full representation of a given AI provider but aims to simplify the differences at a lower layer for ease of use.

PDFMathTranslate
PDFMathTranslate is a tool designed for translating scientific papers and conducting bilingual comparisons. It preserves formulas, charts, table of contents, and annotations. The tool supports multiple languages and diverse translation services. It provides a command-line tool, interactive user interface, and Docker deployment. Users can try the application through online demos. The tool offers various installation methods including command-line, portable, graphic user interface, and Docker. Advanced options allow users to customize translation settings. Additionally, the tool supports secondary development through APIs for Python and HTTP. Future plans include parsing layout with DocLayNet based models, fixing page rotation and format issues, supporting non-PDF/A files, and integrating plugins for Zotero and Obsidian.

chatglm.cpp
ChatGLM.cpp is a C++ implementation of ChatGLM-6B, ChatGLM2-6B, ChatGLM3-6B and more LLMs for real-time chatting on your MacBook. It is based on ggml, working in the same way as llama.cpp. ChatGLM.cpp features accelerated memory-efficient CPU inference with int4/int8 quantization, optimized KV cache and parallel computing. It also supports P-Tuning v2 and LoRA finetuned models, streaming generation with typewriter effect, Python binding, web demo, api servers and more possibilities.

Ling
Ling is a MoE LLM provided and open-sourced by InclusionAI. It includes two different sizes, Ling-Lite with 16.8 billion parameters and Ling-Plus with 290 billion parameters. These models show impressive performance and scalability for various tasks, from natural language processing to complex problem-solving. The open-source nature of Ling encourages collaboration and innovation within the AI community, leading to rapid advancements and improvements. Users can download the models from Hugging Face and ModelScope for different use cases. Ling also supports offline batched inference and online API services for deployment. Additionally, users can fine-tune Ling models using Llama-Factory for tasks like SFT and DPO.

polaris
Polaris establishes a novel, industryβcertified standard to foster the development of impactful methods in AI-based drug discovery. This library is a Python client to interact with the Polaris Hub. It allows you to download Polaris datasets and benchmarks, evaluate a custom method against a Polaris benchmark, and create and upload new datasets and benchmarks.

star-vector
StarVector is a multimodal vision-language model for Scalable Vector Graphics (SVG) generation. It can be used to perform image2SVG and text2SVG generation. StarVector works directly in the SVG code space, leveraging visual understanding to apply accurate SVG primitives. It achieves state-of-the-art performance in producing compact and semantically rich SVGs. The tool provides Hugging Face model checkpoints for image2SVG vectorization, with models like StarVector-8B and StarVector-1B. It also offers datasets like SVG-Stack, SVG-Fonts, SVG-Icons, SVG-Emoji, and SVG-Diagrams for evaluation. StarVector can be trained using Deepspeed or FSDP for tasks like Image2SVG and Text2SVG generation. The tool provides a demo with options for HuggingFace generation or VLLM backend for faster generation speed.

Qwen
Qwen is a series of large language models developed by Alibaba DAMO Academy. It outperforms the baseline models of similar model sizes on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., which evaluate the modelsβ capabilities on natural language understanding, mathematic problem solving, coding, etc. Qwen models outperform the baseline models of similar model sizes on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., which evaluate the modelsβ capabilities on natural language understanding, mathematic problem solving, coding, etc. Qwen-72B achieves better performance than LLaMA2-70B on all tasks and outperforms GPT-3.5 on 7 out of 10 tasks.

yolo-flutter-app
Ultralytics YOLO for Flutter is a Flutter plugin that allows you to integrate Ultralytics YOLO computer vision models into your mobile apps. It supports both Android and iOS platforms, providing APIs for object detection and image classification. The plugin leverages Flutter Platform Channels for seamless communication between the client and host, handling all processing natively. Before using the plugin, you need to export the required models in `.tflite` and `.mlmodel` formats. The plugin provides support for tasks like detection and classification, with specific instructions for Android and iOS platforms. It also includes features like camera preview and methods for object detection and image classification on images. Ultralytics YOLO thrives on community collaboration and offers different licensing paths for open-source and commercial use cases.

AnglE
AnglE is a library for training state-of-the-art BERT/LLM-based sentence embeddings with just a few lines of code. It also serves as a general sentence embedding inference framework, allowing for inferring a variety of transformer-based sentence embeddings. The library supports various loss functions such as AnglE loss, Contrastive loss, CoSENT loss, and Espresso loss. It provides backbones like BERT-based models, LLM-based models, and Bi-directional LLM-based models for training on single or multi-GPU setups. AnglE has achieved significant performance on various benchmarks and offers official pretrained models for both BERT-based and LLM-based models.

Noi
Noi is an AI-enhanced customizable browser designed to streamline digital experiences. It includes curated AI websites, allows adding any URL, offers prompts management, Noi Ask for batch messaging, various themes, Noi Cache Mode for quick link access, cookie data isolation, and more. Users can explore, extend, and empower their browsing experience with Noi.
For similar tasks

GPTQModel
GPTQModel is an easy-to-use LLM quantization and inference toolkit based on the GPTQ algorithm. It provides support for weight-only quantization and offers features such as dynamic per layer/module flexible quantization, sharding support, and auto-heal quantization errors. The toolkit aims to ensure inference compatibility with HF Transformers, vLLM, and SGLang. It offers various model supports, faster quant inference, better quality quants, and security features like hash check of model weights. GPTQModel also focuses on faster quantization, improved quant quality as measured by PPL, and backports bug fixes from AutoGPTQ.

lighteval
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron. We're releasing it with the community in the spirit of building in the open. Note that it is still very much early so don't expect 100% stability ^^' In case of problems or question, feel free to open an issue!

Firefly
Firefly is an open-source large model training project that supports pre-training, fine-tuning, and DPO of mainstream large models. It includes models like Llama3, Gemma, Qwen1.5, MiniCPM, Llama, InternLM, Baichuan, ChatGLM, Yi, Deepseek, Qwen, Orion, Ziya, Xverse, Mistral, Mixtral-8x7B, Zephyr, Vicuna, Bloom, etc. The project supports full-parameter training, LoRA, QLoRA efficient training, and various tasks such as pre-training, SFT, and DPO. Suitable for users with limited training resources, QLoRA is recommended for fine-tuning instructions. The project has achieved good results on the Open LLM Leaderboard with QLoRA training process validation. The latest version has significant updates and adaptations for different chat model templates.

Awesome-Text2SQL
Awesome Text2SQL is a curated repository containing tutorials and resources for Large Language Models, Text2SQL, Text2DSL, Text2API, Text2Vis, and more. It provides guidelines on converting natural language questions into structured SQL queries, with a focus on NL2SQL. The repository includes information on various models, datasets, evaluation metrics, fine-tuning methods, libraries, and practice projects related to Text2SQL. It serves as a comprehensive resource for individuals interested in working with Text2SQL and related technologies.

create-million-parameter-llm-from-scratch
The 'create-million-parameter-llm-from-scratch' repository provides a detailed guide on creating a Large Language Model (LLM) with 2.3 million parameters from scratch. The blog replicates the LLaMA approach, incorporating concepts like RMSNorm for pre-normalization, SwiGLU activation function, and Rotary Embeddings. The model is trained on a basic dataset to demonstrate the ease of creating a million-parameter LLM without the need for a high-end GPU.

StableToolBench
StableToolBench is a new benchmark developed to address the instability of Tool Learning benchmarks. It aims to balance stability and reality by introducing features such as a Virtual API System with caching and API simulators, a new set of solvable queries determined by LLMs, and a Stable Evaluation System using GPT-4. The Virtual API Server can be set up either by building from source or using a prebuilt Docker image. Users can test the server using provided scripts and evaluate models with Solvable Pass Rate and Solvable Win Rate metrics. The tool also includes model experiments results comparing different models' performance.

BetaML.jl
The Beta Machine Learning Toolkit is a package containing various algorithms and utilities for implementing machine learning workflows in multiple languages, including Julia, Python, and R. It offers a range of supervised and unsupervised models, data transformers, and assessment tools. The models are implemented entirely in Julia and are not wrappers for third-party models. Users can easily contribute new models or request implementations. The focus is on user-friendliness rather than computational efficiency, making it suitable for educational and research purposes.

AI-TOD
AI-TOD is a dataset for tiny object detection in aerial images, containing 700,621 object instances across 28,036 images. Objects in AI-TOD are smaller with a mean size of 12.8 pixels compared to other aerial image datasets. To use AI-TOD, download xView training set and AI-TOD_wo_xview, then generate the complete dataset using the provided synthesis tool. The dataset is publicly available for academic and research purposes under CC BY-NC-SA 4.0 license.
For similar jobs

weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.