
GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.
Stars: 803

GPTQModel is an easy-to-use LLM quantization and inference toolkit based on the GPTQ algorithm. It provides support for weight-only quantization and offers features such as dynamic per layer/module flexible quantization, sharding support, and auto-heal quantization errors. The toolkit aims to ensure inference compatibility with HF Transformers, vLLM, and SGLang. It offers various model supports, faster quant inference, better quality quants, and security features like hash check of model weights. GPTQModel also focuses on faster quantization, improved quant quality as measured by PPL, and backports bug fixes from AutoGPTQ.
README:
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.
- 09/24/2025 5.0.0-dev
main
: 🎉 Up to 90% cpu mem saving for large MoE models with faster/inline packing! 26% quant time reduction for Qwen3 MoE! AWQ Marlin kernel added. AWQ Gemm loading bug fixes.act_group_aware
now faster and auto enabled for GPTQ whendesc_act
is False for higher quality recovery. - 09/19/2025 5.0.0-dev
main
: 👀 Cpu memory saving of ~73.5% during quantization stage with newoffload_to_disk
quantization config property default toTrue
. - 09/18/2025 5.0.0-dev
main
: 🎉 AWQ quantization support! Complete refractor and simplification of model definitions in prepreation for future quantization formats. - 09/16/2025 4.2.5:
hyb_act
renamed toact_group_aware
. Removed finickytorch
import withinsetup.py
. Packing bug fix and prebuilt Pytorch 2.8 whls. - 09/12/2025 4.2.0: ✨ New Models Support: Qwen3-Next, Apertus, Kimi K2, Klear, FastLLM, Nemotron H. New
fail_safe
boolean
toggle to.quantize()
to patch-fix non-activatedMoE
modules due to highly uneven MoE model training. Fixed LavaQwen2 compat. Patch fix GIL=0 cuda error for multi-gpu. Fix compat with autoround + new transformers. - 09/04/2025 4.1.0: ✨ Meituan LongCat Flash Chat, Llama 4, GPT-OSS (BF16), and GLM-4.5-Air support. New experiemental
mock_quantization
config to skip complex computational code paths during quantization to accelerate model quant testing. - 08/21/2025 4.0.0: 🎉 New Group Aware Reordering (GAR) support. New models support: Bytedance Seed-OSS, Baidu Ernie, Huawei PanGu, Gemma3, Xiaomi Mimo, Qwen 3/MoE, Falcon H1, GPT-Neo. Memory leak and multiple model compatibility fixes related to Transformers >= 4.54. Python >= 3.13t free-threading support added with near N x GPU linear scaling for quantization of MoE models and also linear N x Cpu Core scaling of packing stage. Early access Pytorch 2.8 fused-ops on Intel XPU for up to 50% speedup.
- 08/19/2025 4.0.0-dev
main
: Fix quantization memory usage due to some model's incorrect application ofconfig.use_cache
during inference. FixedTransformers
>= 4.54.0 compat which changed layer forward return signature for some models. - 08/18/2025 4.0.0-dev
main
: GPT-Neo model support. Memory leak fix in error capture (stacktrace) and fixedlm_head
quantization compatibility for many models. - 07/31/2025 4.0.0-dev
main
: New Group Aware Reordering (GAR) support and prelim Pytorch 2.8 fused-ops for Intel XPU for up to 50% speedup. - 07/03/2025 4.0.0-dev
main
: New Baidu Ernie and Huawei PanGu model support.
Archived News
-
07/02/2025 4.0.0-dev
main
: Gemma3 4B model compat fix. -
05/29/2025 4.0.0-dev
main
: Falcon H1 model support. Fixed Transformers4.52+
compat with Qwen 2.5 VL models. -
05/19/2025 4.0.0-dev
main
: Qwen 2.5 Omni model support. -
05/05/2025 4.0.0-dev
main
: Python 3.13t free-threading support added with near N x GPU linear scaling for quantization of MoE models and also linear N x Cpu Core scaling of packing stage. -
04/29/2025 3.1.0-dev (Now 4.)
main
: Xiaomi Mimo model support. Qwen 3 and 3 MoE model support. New arg forquantize(..., calibration_dataset_min_length=10)
to filter out bad calibration data that exists in public dataset (wikitext). -
04/13/2025 3.0.0: 🎉 New experimental
v2
quantization option for improved model quantization accuracy validated byGSM8K_PLATINUM
benchmarks vs originalgptq
. NewPhi4-MultiModal
model support . New Nvidia Nemotron-Ultra model support. NewDream
model support. New experimentalmulti-gpu
quantization support. Reduced vram usage. Faster quantization. -
04/2/2025 2.2.0: New
Qwen 2.5 VL
model support. Newsamples
log column during quantization to track module activation in MoE models.Loss
log column now color-coded to highlight modules that are friendly/resistant to quantization. Progress (per-step) stats during quantization now streamed to log file. Autobfloat16
dtype loading for models based on model config. Fix kernel compile for Pytorch/ROCm. Slightly faster quantization and auto-resolve some low-level oom issues for smaller vram gpus. -
03/12/2025 2.1.0: ✨ New
QQQ
quantization method and inference support! New GoogleGemma 3
zero-day model support. New AlibabaOvis 2
VL model support. New AMDInstella
zero-day model model support. NewGSM8K Platinum
andMMLU-Pro
benchmarking suppport. Peft Lora training with GPT-QModel is now 30%+ faster on all gpu and IPEX devices. Auto detect MoE modules not activated during quantization due to insufficient calibration data.ROCm
setup.py
compat fixes.Optimum
andPeft
compat fixes. FixedPeft
bfloat16
training. -
03/03/2025 2.0.0: 🎉
GPTQ
quantization internals are now broken into multiple stages (processes) for feature expansion. SyncedMarlin
kernel inference quality fix from upstream. AddedMARLIN_FP16
, lower-quality but faster backend.ModelScope
support added. Logging and cli progress bar output has been revamped with sticky bottom progress. Fixedgeneration_config.json
save and load. Fixed Transformers v4.49.0 compat. Fixed compat of models withoutbos
. Fixedgroup_size=-1
andbits=3
packing regression. Fixed Qwen 2.5 MoE regressions. Added CI tests to track regression in kernel inference quality and sweep all bits/group_sizes. Delegate loggin/progressbar to LogBar pkg. Fix ROCm version auto detection insetup
install. -
02/12/2025 1.9.0: ⚡ Offload
tokenizer
fixes to Toke(n)icer pkg. Optimizedlm_head
quant time and vram usage. OptimizedDeepSeek v3/R1
model quant vram usage. FixedOptimum
compat regresion inv1.8.1
. 3x speed-up forTorch
kernel when using Pytorch >= 2.5.0 withmodel.optimize()
. Newcalibration_dataset_concat_size
option to enable calibration dataconcat
mode to mimic original GPTQ data packing strategy which may improve quant speed and accuracy for datasets likewikitext2
. -
02/08/2025 1.8.1: ⚡
DeepSeek v3/R1
model support. New flexible weightpacking
: allow quantized weights to be packed to[int32, int16, int8]
dtypes.Triton
andTorch
kernels supports full range of newQuantizeConfig.pack_dtype
. Newauto_gc: bool
control inquantize()
which can reduce quantization time for small model with no chance of oom. NewGPTQModel.push_to_hub()
api for easy quant model upload to HF repo. Newbuffered_fwd: bool
control inmodel.quantize()
. Over 50% quantization speed-up for visual (vl) models.
Fixedbits=3
packing andgroup_size=-1
regression in v1.7.4. -
01/26/2025 1.7.4: New
compile()
api for ~4-8% inference tps improvement. Fasterpack()
for post-quantiztion model save.Triton
kernel validated for Intel/XPU
when Intel Triton packages are installed. Fixed Transformers (bug) downcasting tokenizer class on save. -
01/20/2025 1.7.3: New Telechat2 (China Telecom) and PhiMoE model support. Fixed
lm_head
weights duplicated in post-quantize save() for models with tied-embedding. -
01/19/2025 1.7.2: Effective BPW (bits per weight) will now be logged during
load()
. Reduce loading time on Intel Arc A770/B580XPU
by 3.3x. Reduce memory usage in MLX conversion and fix Marlin kernel auto-select not checking CUDA compute version. -
01/17/2025 1.7.0: 👀 ✨
backend.MLX
added for runtime-conversion and execution of GPTQ models on Apple'sMLX
framework on Apple Silicon (M1+). Exports ofgptq
models tomlx
also now possible. We have addedmlx
exported models to huggingface.co/ModelCloud. ✨lm_head
quantization now fully support by GPTQModel without external pkg dependency. -
01/07/2025 1.6.1: 🎉 New OpenAI api compatible end-point via
model.serve(host, port)
. Auto-enable flash-attention2 for inference. Fixedsym=False
loading regression. -
01/06/2025 1.6.0: ⚡25% faster quantization. 35% reduction in vram usage vs v1.5. 👀 AMD ROCm (6.2+) support added and validated for 7900XT+ GPU. Auto-tokenizer loader via
load()
api. For most models you no longer need to manually init a tokenizer for both inference and quantization. -
01/01/2025 1.5.1: 🎉 2025! Added
QuantizeConfig.device
to clearly define which device is used for quantization: default =auto
. Non-quantized models are always loaded on cpu by-default and each layer is moved toQuantizeConfig.device
during quantization to minimize vram usage. Compatibility fixes forattn_implementation_autoset
in latest transformers. -
12/23/2024 1.5.0: Multi-modal (image-to-text) optimized quantization support has been added for Qwen 2-VL and Ovis 1.6-VL. Previous image-to-text model quantizations did not use image calibration data, resulting in less than optimal post-quantization results. Version 1.5.0 is the first release to provide a stable path for multi-modal quantization: only text layers are quantized.
-
12/19/2024 1.4.5: Windows 11 support added/validated. Ovis VL model support with image dataset calibration. Fixed
dynamic
loading. Reduced quantization vram usage. -
12/15/2024 1.4.2: MacOS
gpu
(Metal) andcpu
(M+) support added/validated for inference and quantization. Cohere 2 model support added. -
12/13/2024 1.4.1: Added Qwen2-VL model support.
mse
quantization control exposed inQuantizeConfig
. Monkey patchpatch_vllm()
andpatch_hf()
api added to allow Transformers/Optimum/PEFT and vLLM to correctly loaded GPTQModel quantized models while upstream PRs are in pending status. -
12/10/2024 1.4.0
EvalPlus
harness integration merged upstream. We now support bothlm-eval
andEvalPlus
. Added pure torchTorch
kernel. RefactoredCuda
kernel to beDynamicCuda
kernel.Triton
kernel now auto-padded for max model support.Dynamic
quantization now supports both positive+:
:default, and-:
negative matching which allows matched modules to be skipped entirely for quantization. Fixed auto-Marlin
kerenl selection. Added auto-kernel fallback for unsupported kernel/module pairs. Lots of internal refractor and cleanup in-preparation for transformers/optimum/peft upstream PR merge. Deprecated the saving ofMarlin
weight format sinceMarlin
supports auto conversion ofgptq
format toMarlin
during runtime. -
11/29/2024 1.3.1 Olmo2 model support. Intel XPU acceleration via IPEX. Model sharding Transformer compat fix due to api deprecation in HF. Removed triton dependency. Triton kernel now optionally dependent on triton pkg.
-
11/26/2024 1.3.0 Zero-Day Hymba model support. Removed
tqdm
androgue
dependency. -
11/24/2024 1.2.3 HF GLM model support. ClearML logging integration. Use
device-smi
and replacegputil
+psutil
depends. Fixed model unit tests. -
11/11/2024 🚀 1.2.1 Meta MobileLLM model support added.
lm-eval[gptqmodel]
integration merged upstream. Intel/IPEX cpu inference merged replacing QBits (deprecated). Auto-fix/patch ChatGLM-3/GLM-4 compat with latest transformers. New.load()
and.save()
api. -
10/29/2024 🚀 1.1.0 IBM Granite model support. Full auto-buildless wheel install from pypi. Reduce max cpu memory usage by >20% during quantization. 100% CI model/feature coverage.
-
10/12/2024 ✨ 1.0.9 Move AutoRound to optional and fix pip install regression in v1.0.8.
-
10/11/2024 ✨ 1.0.8 Add wheel for python 3.12 and cuda 11.8.
-
10/08/2024 ✨ 1.0.7 Fixed marlin (faster) kernel was not auto-selected for some models.
-
09/26/2024 ✨ 1.0.6 Fixed quantized Llama 3.2 vision quantized loader.
-
09/26/2024 ✨ 1.0.5 Partial Llama 3.2 Vision model support (mllama): only text-layer quantization layers are supported for now.
-
09/26/2024 ✨ 1.0.4 Integrated Liger Kernel support for ~1/2 memory reduction on some models during quantization. Added control toggle disable parallel packing.
-
09/18/2024 ✨ 1.0.3 Added Microsoft GRIN-MoE and MiniCPM3 support.
-
08/16/2024 ✨ 1.0.2 Support Intel/AutoRound v0.3, pre-built whl packages, and PyPI release.
-
08/14/2024 ✨ 1.0.0 40% faster
packing
, Fixed Python 3.9 compat, addedlm_eval
api. -
08/10/2024 🚀 0.9.11 Added LG EXAONE 3.0 model support. New
dynamic
per layer/module flexible quantization where each layer/module may have different bits/params. Added proper sharding support tobackend.BITBLAS
. Auto-heal quantization errors due to small damp values. -
07/31/2024 🚀 0.9.10 Ported vllm/nm
gptq_marlin
inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models withFORMAT.GPTQ
. Auto calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized() called on pre-quantized models with non-supported backends. HF transformers depend updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference. -
07/25/2024 🚀 0.9.9: Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang, and more.
-
07/13/2024 🚀 0.9.8: Run quantized models directly using GPTQModel using fast
vLLM
orSGLang
backend! Both vLLM and SGLang are optimized for dyanamic batching inference for maximumTPS
(check usage under examples). Marlin backend also got full end-to-end in/out features padding to enhance current/future model compatibility. -
07/08/2024 🚀 0.9.7: InternLM 2.5 model support added.
-
07/08/2024 🚀 0.9.6: Intel/AutoRound QUANT_METHOD support added for a potentially higher quality quantization with
lm_head
module quantization support for even more vram reduction: format export toFORMAT.GPTQ
for max inference compatibility. -
07/05/2024 🚀 0.9.5: Cuda kernels have been fully deprecated in favor of Exllama(v1/v2)/Marlin/Triton.
-
07/03/2024 🚀 0.9.4: HF Transformers integration added and bug fixed Gemma 2 support.
-
07/02/2024 🚀 0.9.3: Added Gemma 2 support, faster PPL calculations on gpu, and more code/arg refractor.
-
06/30/2024 🚀 0.9.2: Added auto-padding of model in/out-features for exllama and exllama v2. Fixed quantization of OPT and DeepSeek V2-Lite models. Fixed inference for DeepSeek V2-Lite.
-
06/29/2024 🚀 0.9.1: With 3 new models (DeepSeek-V2, DeepSeek-V2-Lite, DBRX Converted), BITBLAS new format/kernel, proper batching of calibration dataset resulting > 50% quantization speedup, security hash check of loaded model weights, tons of refractor/usability improvements, bugs fixes and much more.
-
06/20/2924 ✨ 0.9.0: Thanks for all the work from ModelCloud team and the opensource ML community for their contributions!
GPT-QModel is a production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF Transformers, vLLM, and SGLang.
Public and ModelCloud's internal tests have shown that GPTQ is on-par and/or exceeds other 4bit quantization methods in terms of both quality recovery and production-level inference speed for token latency and rps. GPTQ has the optimal blend of quality and inference speed you need in a real-world production deployment.
GPT-QModel not only supports GPTQ but also QQQ, GPTQv2, Eora with more quantization methods and enhancements planned.
GPT-QModel is a modular design supporting multiple quantization methods and feature extensions.
Quantization Feature | GPT-QModel | Transformers | vLLM | SGLang | Lora Training |
---|---|---|---|---|---|
GPTQ | ✅ | ✅ | ✅ | ✅ | ✅ |
EoRA | ✅ | ✅ | ✅ | ✅ | x |
AWQ | ✅ | ✅* | ✅* | ✅* | ✅* |
GPTQ v2 | ✅ | ✅ | ✅ | ✅ | ✅ |
QQQ | ✅ | x | x | x | x |
Rotation | ✅ | x | x | x | x |
Group Aware Activitation Reordering (GPTQ) | ✅ | ✅ | ✅ | ✅ | ✅ |
Native support support some of the most popular multi-modal models:
Multi-Modal | |
---|---|
Qwen 2.5 Omni | ✅ |
Qwen2 VL | ✅ |
Ovis 1.6 + 2 | ✅ |
Phi-4 MultiModal | ✅ |
- ✨ Native integration with HF Transformers, Optimum, and Peft (main)
- 🚀 vLLM and SGLang inference integration for quantized model with format =
FORMAT.GPTQ
- ✨ Linux, MacOS, Windows platform quantization and accelerated inference support for CUDA (Nvidia), XPU (Intel), ROCm (AMD), MPS (Apple Silicon), CPU (Intel/AMD/Apple Silicon).
- 💯 100% CI unit-test coverage for all supported models and kernels including post-quantization quality regression.
- ✨
Dynamic
mixed quantization control on a per-module basis. Each layer/module can have a unique quantization config or be excluded from quantization all together. - 🚀 Intel Torch 2.8 fused kernel support for XPU [
Arc
+Datacenter Max
] and CPU [avx
,amx
,xmx
]. - 🚀 Python 3.13.3t (free-threading, GIL disabled) support for multi-gpu accelerated quantization for MoE models and multi-core cpu boost for post-quant packing.
- ✨ Asymmetric
Sym=False
support. Model weights sharding support with optional hash check of model weights on load. - ✨
lm_head
module quant inference support for further VRAM reduction. - 🚀 Microsoft/BITBLAS format + dynamically compiled inference.
🤗 ModelCloud quantized Vortex models on HF
Experimental GPTQ v2 quantization: Users have reported this mode of quantization may or may not match original GPTQ v1 implementation in terms of quality recovery.
Model | |||||||||
---|---|---|---|---|---|---|---|---|---|
Apertus | ✅ | EXAONE 3.0 | ✅ | InternLM 1/2.5 | ✅ | MobileLLM | ✅ | Qwen 2/2.5 VL | ✅ |
Baichuan | ✅ | Falcon (H1) | ✅ | Kimi K2 | ✅ | MOSS | ✅ | Qwen 2.5 Omni | ✅ |
Bloom | ✅ | FastVLM | ✅ | Klear | ✅ | MPT | ✅ | RefinedWeb | ✅ |
ChatGLM | ✅ | Gemma 1/2/3 | ✅ | Llama 1-3.3 | ✅ | Nemotron H | ✅ | StableLM | ✅ |
CodeGen | ✅ | GPTBigCod | ✅ | Llama 3.2 VL | ✅ | Nemotron Ultra | ✅ | StarCoder2 | ✅ |
Cohere 1-2 | ✅ | GPTQ-Neo/GPT-NeoX | ✅ | Llama 4 | ✅ | OPT | ✅ | TeleChat2 | ✅ |
DBRX Converted | ✅ | GPT-2 | ✅ | LongCatFlash | ✅ | OLMo2 | ✅ | Yi | ✅ |
Deci | ✅ | GPT-J | ✅ | LongLLaMA | ✅ | Ovis 1.6/2 | ✅ | Seed-OSS | ✅ |
DeepSeek-V2/V3/R1 | ✅ | GPT-OSS | ✅ | Instella | ✅ | Phi 1-4 | ✅ | XVERSE | ✅ |
DeepSeek-V2-Lite | ✅ | Granite | ✅ | MiniCPM3 | ✅ | PanGu-α | ✅ | ||
Dream | ✅ | GRIN-MoE | ✅ | Mistral | ✅ | Qwen 1/2/3 | ✅ | ||
ERNIE 4.5 | ✅ | Hymba | ✅ | Mixtral | ✅ | Qwen 2/3 (Next/MoE) | ✅ |
GPT-QModel is validated for Linux, MacOS, and Windows 11:
Platform | Device | Optimized Arch | Kernels | |
---|---|---|---|---|
🐧 Linux | Nvidia GPU | ✅ | Ampere+ |
Marlin, Exllama V2, Exallma V1, Triton, Torch |
🐧 Linux | AMD GPU | ✅ |
7900XT+ , ROCm 6.2+
|
Exllama V2, Exallma V1, Torch |
🐧 Linux | Intel XPU | ✅ |
Arc , Datacenter Max
|
Torch Fused (Python 2.8+), Torch |
🐧 Linux | Intel/AMD CPU | ✅ |
avx , amx , xmx
|
Torch Fused, Torch |
🍎 MacOS | GPU (Metal) / CPU | ✅ |
Apple Silicon , M1+
|
Torch, MLX via conversion |
🪟 Windows | GPU (Nvidia) / CPU | ✅ | Nvidia |
Torch |
# You can install optional modules like autoround, ipex, vllm, sglang, bitblas.
# Example: pip install -v --no-build-isolation gptqmodel[vllm,sglang,bitblas]
pip install -v gptqmodel --no-build-isolation
uv pip install -v gptqmodel --no-build-isolation
# clone repo
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel
# pip: compile and install
# You can install optional modules like autoround, ipex, vllm, sglang, bitblas.
# Example: pip install -v --no-build-isolation .[vllm,sglang,bitblas]
pip install -v . --no-build-isolation
Three line api to use GPT-QModel
for gptq model inference:
from gptqmodel import GPTQModel
model = GPTQModel.load("ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5")
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output
To use models from ModelScope instead of HuggingFace Hub, set an environment variable:
export GPTQMODEL_USE_MODELSCOPE=True
from gptqmodel import GPTQModel
# load Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4 from modelscope
model = GPTQModel.load("Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4")
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output
# load model using above inference guide first
model.serve(host="0.0.0.0",port="12345")
Basic example of using GPT-QModel
to quantize a llm model:
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"
calibration_dataset = load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(1024))["text"]
quant_config = QuantizeConfig(bits=4, group_size=128)
model = GPTQModel.load(model_id, quant_config)
# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=1)
model.save(quant_path)
Enable GPTQ v2 quantization by setting v2 = True
for potentially higher post-quantization accuracy recovery.
# note v2 is currently experiemental and requires 2-4x more vram to execute
# if oom on 1 gpu, please set CUDA_VISIBLE_DEVICES=0,1 to 2 gpu and gptqmodel will auto use second gpu
quant_config = QuantizeConfig(bits=4, group_size=128, v2=True)
Llama 3.1 8B-Instruct
quantized using test/models/test_llama3_2.py
Method | Bits/Group Size | ARC_CHALLENGE | GSM8K_Platinum_COT |
---|---|---|---|
GPTQ | 4 / 128 | 49.15 | 48.30 |
GPTQ v2 | 4 / 128 | 49.74 +1.20% | 61.46 +27.25% |
GPTQ | 3 / 128 | 39.93 | 43.26 |
GPTQ v2 | 3 / 128 | 41.13 +3.01% | 50.54 +16.83% |
# test post-quant inference
model = GPTQModel.load(quant_path)
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output
GPT-QModel now support EoRA, a LoRA method that can further imporve the accuracy of the quantized model
# higher rank improves accuracy at the cost of vram usage
# suggestion: test rank 64 and 32 before 128 or 256 as latter may overfit while increasing memory usage
eora = Lora(
# for eora generation, path is adapter save path; for load, it is loading path
path=f"{quant_path}/eora_rank32",
rank=32,
)
# provide a previously gptq quantized model path
GPTQModel.adapter.generate(
adapter=eora,
model_id_or_path=model_id,
quantized_model_id_or_path=quant_path,
calibration_dataset=calibration_dataset,
calibration_dataset_concat_size=0,
)
# post-eora inference
model = GPTQModel.load(
model_id_or_path=quant_path,
adapter=eora
)
tokens = model.generate("Capital of France is")[0]
result = model.tokenizer.decode(tokens)
print(f"Result: {result}")
# For more detail of EoRA please see GPTQModel/examples/eora
# Please use the benchmark tools in later part of this README to evaluate EoRA effectiveness
For more advanced features of model quantization, please reference to this script
Read the gptqmodel/models/llama.py
code which explains in detail via comments how the model support is defined. Use it as guide to PR for to new models. Most models follow the same pattern.
GPTQModel inference is integrated into both lm-eval and evalplus
We highly recommend avoid using ppl
and use lm-eval
/evalplus
to validate post-quantization model quality. ppl
should only be used for regression tests and is not a good indicator of model output quality.
# gptqmodel is integrated into lm-eval >= v0.4.7
pip install lm-eval>=0.4.7
# gptqmodel is integrated into evalplus[main]
pip install -U "evalplus @ git+https://github.com/evalplus/evalplus"
Below is a basic sample using GPTQModel.eval
API
from gptqmodel import GPTQModel
from gptqmodel.utils.eval import EVAL
model_id = "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1"
# Use `lm-eval` as framework to evaluate the model
lm_eval_data = GPTQModel.eval(model_id,
framework=EVAL.LM_EVAL,
tasks=[EVAL.LM_EVAL.ARC_CHALLENGE])
# Use `evalplus` as framework to evaluate the model
evalplus_data = GPTQModel.eval(model_id,
framework=EVAL.EVALPLUS,
tasks=[EVAL.EVALPLUS.HUMAN])
QuantizeConfig.dynamic
is dynamic control which allows specific matching modules
to be skipped for quantization (negative matching)
or have a unique [bits, group_size, sym, desc_act, mse, pack_dtype]
property override per matching module
vs base QuantizeConfig
(postive match with override).
Sample QuantizerConfig.dynamic
usage:
dynamic = {
# `.*\.` matches the layers_node prefix
# layer index start at 0
# positive match: layer 19, gate module
r"+:.*\.18\..*gate.*": {"bits": 4, "group_size": 32},
# positgive match: layer 20, gate module (prefix defaults to positive if missing)
r".*\.19\..*gate.*": {"bits": 8, "group_size": 64},
# negative match: skip layer 21, gate module
r"-:.*\.20\..*gate.*": {},
# negative match: skip all down modules for all layers
r"-:.*down.*": {},
}
Group Aware Reordering (GAR) is an enhanced activation reordering scheme designed to significantly improve the accuracy of quantized models without incurring additional inference overhead. Unlike traditional activation reordering, GAR restricts permutations to within individual groups or rearrangements of entire groups. This ensures each group's associated scales and zero-points remain efficiently accessible during inference, thereby avoiding any inference-time overhead.
How to enable GAR:
Set the act_group_aware
parameter to True
and disable the default activation reordering by setting desc_act
to False
in your QuantizeConfig
. For example:
quant_config = QuantizeConfig(bits=4, group_size=128, act_group_aware=True)
- GPTQ v2: set
v2=True
in quantization config.
- GPTQ (v1): IST-DASLab, main-author: Elias Frantar, arXiv:2210.17323
- GPTQ (v2*): Yale Intelligent Computing Lab, main-author: Yuhang Li, arXiv:2504.02692. v2 naming is by Yale author and not endorsed by original GPTQ authors.
- QQQ: Meituan, main-author Ying Zhang, arXiv:2406.09904
- EoRA: Nvidia, main-author: Shih-Yang Liu, arXiv preprint arXiv:2410.21271.
- GAR: Intel, main-author: T Gafni, A Karnieli, Y Hanani, Paper
- AWQ: main-authors: Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song
# GPT-QModel
@misc{qubitium2024gptqmodel,
author = {ModelCloud.ai and qubitium@modelcloud.ai},
title = {GPT-QModel},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/modelcloud/gptqmodel}},
note = {Contact: qubitium@modelcloud.ai},
year = {2024},
}
# GPTQ
@article{frantar-gptq,
title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers},
author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
journal={arXiv preprint arXiv:2210.17323},
year={2022}
}
# EoRA
@article{liu2024eora,
title={EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation},
author={Liu, Shih-Yang and Yang, Huck and Wang, Chien-Yi and Fung, Nai Chit and Yin, Hongxu and Sakr, Charbel and Muralidharan, Saurav and Cheng, Kwang-Ting and Kautz, Jan and Wang, Yu-Chiang Frank and others},
journal={arXiv preprint arXiv:2410.21271},
year={2024}
}
# Group Aware Reordering (GAR)
@article{gar,
title={Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference, CVPRW 2025.},
author={T. Gafni, A. Karnieli, Y. Hanani},
journal={arXiv preprint arXiv:2505.14638},
year={2025}
}
# GPTQ Marlin Kernel
@article{frantar2024marlin,
title={MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models},
author={Frantar, Elias and Castro, Roberto L and Chen, Jiale and Hoefler, Torsten and Alistarh, Dan},
journal={arXiv preprint arXiv:2408.11743},
year={2024}
}
# QQQ
@article{zhang2024qqq,
title={QQQ: Quality Quattuor-Bit Quantization for Large Language Models},
author={Ying Zhang and Peng Zhang and Mincong Huang and Jingyang Xiang and Yujie Wang and Chao Wang and Yineng Zhang and Lei Yu and Chuan Liu and Wei Lin},
journal={arXiv preprint arXiv:2406.09904},
year={2024}
}
# AWQ
@article{lin2023awq,
title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
journal={arXiv},
year={2023}
}
# GPTQ v2
@article{li2025gptqv2,
title={GPTQv2: Efficient Finetuning-Free Quantization for Asymmetric Calibration},
author={Yuhang Li and Ruokai Yin and Donghyun Lee and Shiting Xiao and Priyadarshini Panda},
journal={arXiv preprint arXiv:2504.02692},
year={2025}
}
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for GPTQModel
Similar Open Source Tools

GPTQModel
GPTQModel is an easy-to-use LLM quantization and inference toolkit based on the GPTQ algorithm. It provides support for weight-only quantization and offers features such as dynamic per layer/module flexible quantization, sharding support, and auto-heal quantization errors. The toolkit aims to ensure inference compatibility with HF Transformers, vLLM, and SGLang. It offers various model supports, faster quant inference, better quality quants, and security features like hash check of model weights. GPTQModel also focuses on faster quantization, improved quant quality as measured by PPL, and backports bug fixes from AutoGPTQ.

stable-diffusion.cpp
The stable-diffusion.cpp repository provides an implementation for inferring stable diffusion in pure C/C++. It offers features such as support for different versions of stable diffusion, lightweight and dependency-free implementation, various quantization support, memory-efficient CPU inference, GPU acceleration, and more. Users can download the built executable program or build it manually. The repository also includes instructions for downloading weights, building from scratch, using different acceleration methods, running the tool, converting weights, and utilizing various features like Flash Attention, ESRGAN upscaling, PhotoMaker support, and more. Additionally, it mentions future TODOs and provides information on memory requirements, bindings, UIs, contributors, and references.

paiml-mcp-agent-toolkit
PAIML MCP Agent Toolkit (PMAT) is a zero-configuration AI context generation system with extreme quality enforcement and Toyota Way standards. It allows users to analyze any codebase instantly through CLI, MCP, or HTTP interfaces. The toolkit provides features such as technical debt analysis, advanced monitoring, metrics aggregation, performance profiling, bottleneck detection, alert system, multi-format export, storage flexibility, and more. It also offers AI-powered intelligence for smart recommendations, polyglot analysis, repository showcase, and integration points. PMAT enforces quality standards like complexity ≤20, zero SATD comments, test coverage >80%, no lint warnings, and synchronized documentation with commits. The toolkit follows Toyota Way development principles for iterative improvement, direct AST traversal, automated quality gates, and zero SATD policy.

verl
veRL is a flexible and efficient reinforcement learning training framework designed for large language models (LLMs). It allows easy extension of diverse RL algorithms, seamless integration with existing LLM infrastructures, and flexible device mapping. The framework achieves state-of-the-art throughput and efficient actor model resharding with 3D-HybridEngine. It supports popular HuggingFace models and is suitable for users working with PyTorch FSDP, Megatron-LM, and vLLM backends.

rig
Rig is a Rust library designed for building scalable, modular, and user-friendly applications powered by large language models (LLMs). It provides full support for LLM completion and embedding workflows, offers simple yet powerful abstractions for LLM providers like OpenAI and Cohere, as well as vector stores such as MongoDB and in-memory storage. With Rig, users can easily integrate LLMs into their applications with minimal boilerplate code.

Fast-LLM
Fast-LLM is an open-source library designed for training large language models with exceptional speed, scalability, and flexibility. Built on PyTorch and Triton, it offers optimized kernel efficiency, reduced overheads, and memory usage, making it suitable for training models of all sizes. The library supports distributed training across multiple GPUs and nodes, offers flexibility in model architectures, and is easy to use with pre-built Docker images and simple configuration. Fast-LLM is licensed under Apache 2.0, developed transparently on GitHub, and encourages contributions and collaboration from the community.

ktransformers
KTransformers is a flexible Python-centric framework designed to enhance the user's experience with advanced kernel optimizations and placement/parallelism strategies for Transformers. It provides a Transformers-compatible interface, RESTful APIs compliant with OpenAI and Ollama, and a simplified ChatGPT-like web UI. The framework aims to serve as a platform for experimenting with innovative LLM inference optimizations, focusing on local deployments constrained by limited resources and supporting heterogeneous computing opportunities like GPU/CPU offloading of quantized models.

alignment-handbook
The Alignment Handbook provides robust training recipes for continuing pretraining and aligning language models with human and AI preferences. It includes techniques such as continued pretraining, supervised fine-tuning, reward modeling, rejection sampling, and direct preference optimization (DPO). The handbook aims to fill the gap in public resources on training these models, collecting data, and measuring metrics for optimal downstream performance.

mcp-context-forge
MCP Context Forge is a powerful tool for generating context-aware data for machine learning models. It provides functionalities to create diverse datasets with contextual information, enhancing the performance of AI algorithms. The tool supports various data formats and allows users to customize the context generation process easily. With MCP Context Forge, users can efficiently prepare training data for tasks requiring contextual understanding, such as sentiment analysis, recommendation systems, and natural language processing.

ml-engineering
This repository provides a comprehensive collection of methodologies, tools, and step-by-step instructions for successful training of large language models (LLMs) and multi-modal models. It is a technical resource suitable for LLM/VLM training engineers and operators, containing numerous scripts and copy-n-paste commands to facilitate quick problem-solving. The repository is an ongoing compilation of the author's experiences training BLOOM-176B and IDEFICS-80B models, and currently focuses on the development and training of Retrieval Augmented Generation (RAG) models at Contextual.AI. The content is organized into six parts: Insights, Hardware, Orchestration, Training, Development, and Miscellaneous. It includes key comparison tables for high-end accelerators and networks, as well as shortcuts to frequently needed tools and guides. The repository is open to contributions and discussions, and is licensed under Attribution-ShareAlike 4.0 International.

RustGPT
A complete Large Language Model implementation in pure Rust with no external ML frameworks. Demonstrates building a transformer-based language model from scratch, including pre-training, instruction tuning, interactive chat mode, full backpropagation, and modular architecture. Model learns basic world knowledge and conversational patterns. Features custom tokenization, greedy decoding, gradient clipping, modular layer system, and comprehensive test coverage. Ideal for understanding modern LLMs and key ML concepts. Dependencies include ndarray for matrix operations and rand for random number generation. Contributions welcome for model persistence, performance optimizations, better sampling, evaluation metrics, advanced architectures, training improvements, data handling, and model analysis. Follows standard Rust conventions and encourages contributions at beginner, intermediate, and advanced levels.

LocalLLMClient
LocalLLMClient is a Swift package designed to interact with local Large Language Models (LLMs) on Apple platforms. It supports GGUF, MLX models, and the FoundationModels framework, providing streaming API, multimodal capabilities, and tool calling functionalities. Users can easily integrate this tool to work with various models for text generation and processing. The package also includes advanced features for low-level API control and multimodal image processing. LocalLLMClient is experimental and subject to API changes, offering support for iOS, macOS, and Linux platforms.

TuyaOpen
TuyaOpen is an open source AI+IoT development framework supporting cross-chip platforms and operating systems. It provides core functionalities for AI+IoT development, including pairing, activation, control, and upgrading. The SDK offers robust security and compliance capabilities, meeting data compliance requirements globally. TuyaOpen enables the development of AI+IoT products that can leverage the Tuya APP ecosystem and cloud services. It continues to expand with more cloud platform integration features and capabilities like voice, video, and facial recognition.

nekro-agent
Nekro Agent is an AI chat plugin and proxy execution bot that is highly scalable, offers high freedom, and has minimal deployment requirements. It features context-aware chat for group/private chats, custom character settings, sandboxed execution environment, interactive image resource handling, customizable extension development interface, easy deployment with docker-compose, integration with Stable Diffusion for AI drawing capabilities, support for various file types interaction, hot configuration updates and command control, native multimodal understanding, visual application management control panel, CoT (Chain of Thought) support, self-triggered timers and holiday greetings, event notification understanding, and more. It allows for third-party extensions and AI-generated extensions, and includes features like automatic context trigger based on LLM, and a variety of basic commands for bot administrators.

xllm
xLLM is an efficient LLM inference framework optimized for Chinese AI accelerators, enabling enterprise-grade deployment with enhanced efficiency and reduced cost. It adopts a service-engine decoupled inference architecture, achieving breakthrough efficiency through technologies like elastic scheduling, dynamic PD disaggregation, multi-stream parallel computing, graph fusion optimization, and global KV cache management. xLLM supports deployment of mainstream large models on Chinese AI accelerators, empowering enterprises in scenarios like intelligent customer service, risk control, supply chain optimization, ad recommendation, and more.

deepflow
DeepFlow is an open-source project that provides deep observability for complex cloud-native and AI applications. It offers Zero Code data collection with eBPF for metrics, distributed tracing, request logs, and function profiling. DeepFlow is integrated with SmartEncoding to achieve Full Stack correlation and efficient access to all observability data. With DeepFlow, cloud-native and AI applications automatically gain deep observability, removing the burden of developers continually instrumenting code and providing monitoring and diagnostic capabilities covering everything from code to infrastructure for DevOps/SRE teams.
For similar tasks

GPTQModel
GPTQModel is an easy-to-use LLM quantization and inference toolkit based on the GPTQ algorithm. It provides support for weight-only quantization and offers features such as dynamic per layer/module flexible quantization, sharding support, and auto-heal quantization errors. The toolkit aims to ensure inference compatibility with HF Transformers, vLLM, and SGLang. It offers various model supports, faster quant inference, better quality quants, and security features like hash check of model weights. GPTQModel also focuses on faster quantization, improved quant quality as measured by PPL, and backports bug fixes from AutoGPTQ.

lighteval
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron. We're releasing it with the community in the spirit of building in the open. Note that it is still very much early so don't expect 100% stability ^^' In case of problems or question, feel free to open an issue!

Firefly
Firefly is an open-source large model training project that supports pre-training, fine-tuning, and DPO of mainstream large models. It includes models like Llama3, Gemma, Qwen1.5, MiniCPM, Llama, InternLM, Baichuan, ChatGLM, Yi, Deepseek, Qwen, Orion, Ziya, Xverse, Mistral, Mixtral-8x7B, Zephyr, Vicuna, Bloom, etc. The project supports full-parameter training, LoRA, QLoRA efficient training, and various tasks such as pre-training, SFT, and DPO. Suitable for users with limited training resources, QLoRA is recommended for fine-tuning instructions. The project has achieved good results on the Open LLM Leaderboard with QLoRA training process validation. The latest version has significant updates and adaptations for different chat model templates.

Awesome-Text2SQL
Awesome Text2SQL is a curated repository containing tutorials and resources for Large Language Models, Text2SQL, Text2DSL, Text2API, Text2Vis, and more. It provides guidelines on converting natural language questions into structured SQL queries, with a focus on NL2SQL. The repository includes information on various models, datasets, evaluation metrics, fine-tuning methods, libraries, and practice projects related to Text2SQL. It serves as a comprehensive resource for individuals interested in working with Text2SQL and related technologies.

create-million-parameter-llm-from-scratch
The 'create-million-parameter-llm-from-scratch' repository provides a detailed guide on creating a Large Language Model (LLM) with 2.3 million parameters from scratch. The blog replicates the LLaMA approach, incorporating concepts like RMSNorm for pre-normalization, SwiGLU activation function, and Rotary Embeddings. The model is trained on a basic dataset to demonstrate the ease of creating a million-parameter LLM without the need for a high-end GPU.

StableToolBench
StableToolBench is a new benchmark developed to address the instability of Tool Learning benchmarks. It aims to balance stability and reality by introducing features such as a Virtual API System with caching and API simulators, a new set of solvable queries determined by LLMs, and a Stable Evaluation System using GPT-4. The Virtual API Server can be set up either by building from source or using a prebuilt Docker image. Users can test the server using provided scripts and evaluate models with Solvable Pass Rate and Solvable Win Rate metrics. The tool also includes model experiments results comparing different models' performance.

BetaML.jl
The Beta Machine Learning Toolkit is a package containing various algorithms and utilities for implementing machine learning workflows in multiple languages, including Julia, Python, and R. It offers a range of supervised and unsupervised models, data transformers, and assessment tools. The models are implemented entirely in Julia and are not wrappers for third-party models. Users can easily contribute new models or request implementations. The focus is on user-friendliness rather than computational efficiency, making it suitable for educational and research purposes.

AI-TOD
AI-TOD is a dataset for tiny object detection in aerial images, containing 700,621 object instances across 28,036 images. Objects in AI-TOD are smaller with a mean size of 12.8 pixels compared to other aerial image datasets. To use AI-TOD, download xView training set and AI-TOD_wo_xview, then generate the complete dataset using the provided synthesis tool. The dataset is publicly available for academic and research purposes under CC BY-NC-SA 4.0 license.
For similar jobs

weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.