GPTQModel
Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
Stars: 203
GPTQModel is an easy-to-use LLM quantization and inference toolkit based on the GPTQ algorithm. It provides support for weight-only quantization and offers features such as dynamic per layer/module flexible quantization, sharding support, and auto-heal quantization errors. The toolkit aims to ensure inference compatibility with HF Transformers, vLLM, and SGLang. It offers various model supports, faster quant inference, better quality quants, and security features like hash check of model weights. GPTQModel also focuses on faster quantization, improved quant quality as measured by PPL, and backports bug fixes from AutoGPTQ.
README:
Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
- 01/17/2025 1.7.0: π β¨
backend.MLX
added for runtime-conversion and execution of GPTQ models on Apple'sMLX
framework on Apple Silicon (M1+). Exports ofgptq
models tomlx
also now possible. We have addedmlx
exported models to huggingface.co/ModelCloud. β¨lm_head
quantization now fully support by GPTQModel without external pkg dependency. - 01/07/2025 1.6.1: π New OpenAI api compatible end-point via
model.serve(host, port)
. Auto-enable flash-attention2 for inference. Fixedsym=False
loading regression. - 01/06/2025 1.6.0: β‘25% faster quantization. 35% reduction in vram usage vs v1.5. π AMD ROCm (6.2+) support added and validated for 7900XT+ GPU. Auto-tokenizer loader via
load()
api. For most models you no longer need to manually init a tokenizer for both inference and quantization. - 01/01/2025 1.5.1: π 2025! Added
QuantizeConfig.device
to clearly define which device is used for quantization: default =auto
. Non-quantized models are always loaded on cpu by-default and each layer is moved toQuantizeConfig.device
during quantization to minimize vram usage. Compatibility fixes forattn_implementation_autoset
in latest transformers. - 12/23/2024 1.5.0: Multi-modal (image-to-text) optimized quantization support has been added for Qwen 2-VL and Ovis 1.6-VL. Previous image-to-text model quantizations did not use image calibration data, resulting in less than optimal post-quantization results. Version 1.5.0 is the first release to provide a stable path for multi-modal quantization: only text layers are quantized.
- 12/19/2024 1.4.5: Windows 11 support added/validated. Ovis VL model support with image dataset calibration. Fixed
dynamic
loading. Reduced quantization vram usage. - 12/15/2024 1.4.2: MacOS
gpu
(Metal) andcpu
(M+) support added/validated for inference and quantization. Cohere 2 model support added. - 12/13/2024 1.4.1: Added Qwen2-VL model support.
mse
quantization control exposed inQuantizeConfig
. Monkey patchpatch_vllm()
andpatch_hf()
api added to allow Transformers/Optimum/PEFT and vLLM to correctly loaded GPTQModel quantized models while upstream PRs are in pending status.
Archived News
* 12/10/2024 [1.4.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.0) `EvalPlus` harness integration merged upstream. We now support both `lm-eval` and `EvalPlus`. Added pure torch `Torch` kernel. Refactored `Cuda` kernel to be `DynamicCuda` kernel. `Triton` kernel now auto-padded for max model support. `Dynamic` quantization now supports both positive `+:`:default, and `-:` negative matching which allows matched modules to be skipped entirely for quantization. Fixed auto-`Marlin` kerenl selection. Added auto-kernel fallback for unsupported kernel/module pairs. Lots of internal refractor and cleanup in-preparation for transformers/optimum/peft upstream PR merge. Deprecated the saving of `Marlin` weight format since `Marlin` supports auto conversion of `gptq` format to `Marlin` during runtime.-
11/29/2024 1.3.1 Olmo2 model support. Intel XPU acceleration via IPEX. Model sharding Transformer compat fix due to api deprecation in HF. Removed triton dependency. Triton kernel now optionally dependent on triton pkg.
-
11/26/2024 1.3.0 Zero-Day Hymba model support. Removed
tqdm
androgue
dependency. -
11/24/2024 1.2.3 HF GLM model support. ClearML logging integration. Use
device-smi
and replacegputil
+psutil
depends. Fixed model unit tests. -
11/11/2024 π 1.2.1 Meta MobileLLM model support added.
lm-eval[gptqmodel]
integration merged upstream. Intel/IPEX cpu inference merged replacing QBits (deprecated). Auto-fix/patch ChatGLM-3/GLM-4 compat with latest transformers. New.load()
and.save()
api. -
10/29/2024 π 1.1.0 IBM Granite model support. Full auto-buildless wheel install from pypi. Reduce max cpu memory usage by >20% during quantization. 100% CI model/feature coverage.
-
10/12/2024 β¨ 1.0.9 Move AutoRound to optional and fix pip install regression in v1.0.8.
-
10/11/2024 β¨ 1.0.8 Add wheel for python 3.12 and cuda 11.8.
-
10/08/2024 β¨ 1.0.7 Fixed marlin (faster) kernel was not auto-selected for some models.
-
09/26/2024 β¨ 1.0.6 Fixed quantized Llama 3.2 vision quantized loader.
-
09/26/2024 β¨ 1.0.5 Partial Llama 3.2 Vision model support (mllama): only text-layer quantization layers are supported for now.
-
09/26/2024 β¨ 1.0.4 Integrated Liger Kernel support for ~1/2 memory reduction on some models during quantization. Added control toggle disable parallel packing.
-
09/18/2024 β¨ 1.0.3 Added Microsoft GRIN-MoE and MiniCPM3 support.
-
08/16/2024 β¨ 1.0.2 Support Intel/AutoRound v0.3, pre-built whl packages, and PyPI release.
-
08/14/2024 β¨ 1.0.0 40% faster
packing
, Fixed Python 3.9 compat, addedlm_eval
api. -
08/10/2024 π 0.9.11 Added LG EXAONE 3.0 model support. New
dynamic
per layer/module flexible quantization where each layer/module may have different bits/params. Added proper sharding support tobackend.BITBLAS
. Auto-heal quantization errors due to small damp values. -
07/31/2024 π 0.9.10 Ported vllm/nm
gptq_marlin
inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models withFORMAT.GPTQ
. Auto calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized() called on pre-quantized models with non-supported backends. HF transformers depend updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference. -
07/25/2024 π 0.9.9: Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang, and more.
-
07/13/2024 π 0.9.8: Run quantized models directly using GPTQModel using fast
vLLM
orSGLang
backend! Both vLLM and SGLang are optimized for dyanamic batching inference for maximumTPS
(check usage under examples). Marlin backend also got full end-to-end in/out features padding to enhance current/future model compatibility. -
07/08/2024 π 0.9.7: InternLM 2.5 model support added.
-
07/08/2024 π 0.9.6: Intel/AutoRound QUANT_METHOD support added for a potentially higher quality quantization with
lm_head
module quantization support for even more vram reduction: format export toFORMAT.GPTQ
for max inference compatibility. -
07/05/2024 π 0.9.5: Cuda kernels have been fully deprecated in favor of Exllama(v1/v2)/Marlin/Triton.
-
07/03/2024 π 0.9.4: HF Transformers integration added and bug fixed Gemma 2 support.
-
07/02/2024 π 0.9.3: Added Gemma 2 support, faster PPL calculations on gpu, and more code/arg refractor.
-
06/30/2024 π 0.9.2: Added auto-padding of model in/out-features for exllama and exllama v2. Fixed quantization of OPT and DeepSeek V2-Lite models. Fixed inference for DeepSeek V2-Lite.
-
06/29/2024 π 0.9.1: With 3 new models (DeepSeek-V2, DeepSeek-V2-Lite, DBRX Converted), BITBLAS new format/kernel, proper batching of calibration dataset resulting > 50% quantization speedup, security hash check of loaded model weights, tons of refractor/usability improvements, bugs fixes and much more.
-
06/20/2924 β¨ 0.9.0: Thanks for all the work from ModelCloud team and the opensource ML community for their contributions!
GPTQModel started out as a major refractor (fork) of AutoGPTQ but has now morphed into a full-stand-in replacement with cleaner api, up-to-date model support, faster inference, higher quality quants and a pledge that ModelCloud, together with the open-source ML community, will take every effort to bring the library up-to-date with latest advancements and model support.
Public tests/papers and ModelCloud's internal tests have shown that GPTQ is on-par and/or exceeds other 4bit quantization methods in terms of both quality recovery and production-level inference speed for token latency and rps. GPTQ has the optimal blend of quality and inference speed you need in a real-world production deployment.
- π Extensive model support for:
Ovis VL
,Llama 1-3.3
,Qwen2-VL
,Olmo2
,Hymba
,GLM
,IBM Granite
,Llama 3.2 Vision
,MiniCPM3
,GRIN-Moe
,Phi 1-4
,EXAONE 3.0
,InternLM 2.5
,Gemma 2
,DeepSeek-V2
,DeepSeek-V2-Lite
,ChatGLM
,MiniCPM
,Qwen2MoE
,DBRX
. - β¨ Linux, MacOS, Windows platform quantization and accelerated inference support for CUDA (Nvidia), XPU (Intel), ROCm (AMD), MPS (Apple Silicon), CPU (Intel/AMD/Apple Silicon).
- π― 100% CI unit-test coverage for all supported models and kernels including post-quantization quality regression.
- β¨
Dynamic
mixed quantization control on a per-module basis. Each layer/module can have a unique quantization config or be excluded from quantization all together. - π vLLM and SGLang inference integration for quantized model where format =
FORMAT.GPTQ
- π Intel/IPEX hardware accelerated quantization/inference for CPU [
avx
,amx
,xmx
] and Intel GPU [Arc
+Datacenter Max
]. - π Microsoft/BITBLAS format + dynamically compiled inference.
- β¨ Intel/AutoRound alternative gptq-inference compatible quantization method.
- β¨ Asymmetric
Sym=False
support. - β¨
lm_head
module quant inference support for further VRAM reduction (auto-round only). - β¨ Model weights sharding support with optional hash check of model weights on load.
- π 40% faster
packing
stage in quantization (Llama 3.1 8B). 50% faster PPL calculations (OPT).
π€ ModelCloud quantized Vortex models on HF
Model | |||||||||
---|---|---|---|---|---|---|---|---|---|
Baichuan | β | Falcon | β | Llama 1-3.3 | β | OLMo2 | β | Yi | β |
Bloom | β | Gemma 2 | β | Llama 3.2 VL | β | Ovis 1.6 | β | ||
ChatGLM | β | GPTBigCod | β | LongLLaMA | β | Phi 1-4 | β | ||
CodeGen | β | GPTNeoX | β | MiniCPM3 | β | Qwen | β | ||
Cohere 1-2 | β | GPT-2 | β | Mistral | β | Qwen2 MoE | β | ||
DBRX Converted | β | GPT-J | β | Mixtral | β | Qwen2 VL | β | ||
Deci | β | Granite | β | MobileLLM | β | RefinedWeb | β | ||
DeepSeek-V2 | β | GRIN-MoE | β | MOSS | β | StableLM | β | ||
DeepSeek-V2-Lite | β | Hymba | β | MPT | β | StarCoder2 | β | ||
EXAONE 3.0 | β | InternLM 1/2.5 | β | OPT | β | XVERSE | β |
GPTQModel is validated for Linux, MacOS, and Windows 11:
Platform | Device | Optimized Arch | Kernels | |
---|---|---|---|---|
π§ Linux | Nvidia GPU | β | Ampere+ |
Marlin, Exllama V2, Exallma V1, Triton, DyanamicCuda, Torch |
π§ Linux | Intel XPU | β |
Arc , Datacenter Max
|
IPEX, Torch |
π§ Linux | AMD GPU | β |
7900XT+ , ROCm 6.2+
|
Exllama V2, Exallma V1, DyanamicCuda, Torch |
π§ Linux | Intel/AMD CPU | β |
avx , amx , xmx
|
IPEX, Torch |
π MacOS | GPU (Metal) / CPU | β |
Apple Silicon , M1+
|
Torch |
πͺ Windows | GPU (Nvidia) / CPU | β | Nvidia |
DynamicCuda, Torch |
# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.
# Example: pip install -v --no-build-isolation gptqmodel[vllm,sglang,bitblas,ipex,auto_round]
pip install -v gptqmodel --no-build-isolation
uv pip install -v gptqmodel --no-build-isolation
# clone repo
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel
# pip: compile and install
# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.
# Example: pip install -v --no-build-isolation .[vllm,sglang,bitblas,ipex,auto_round]
pip install -v . --no-build-isolation
Two line api to use GPTQModel
for gptq model inference:
from gptqmodel import GPTQModel
model = GPTQModel.load("ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5")
result = model.generate("Uncovering deep insights begins with")[0]
# load model using above inference guide first
model.serve(host="0.0.0.0",port="12345")
Basic example of using GPTQModel
to quantize a llm model:
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"
calibration_dataset = load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(1024))["text"]
quant_config = QuantizeConfig(bits=4, group_size=128)
model = GPTQModel.load(model_id, quant_config)
# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=2)
model.save(quant_path)
# test post-quant inference
model = GPTQModel.load(quant_path)
result = model.generate("Uncovering deep insights begins with")[0]
For more advanced features of model quantization, please reference to this script
Read the gptqmodel/models/llama.py
code which explains in detail via comments how the model support is defined. Use it as guide to PR for to new models. Most models follow the same pattern.
GPTQModel inference is integrated into both lm-eval and evalplus
We highly recommend avoid using ppl
and use lm-eval
/evalplus
to validate post-quantization model quality. ppl
should only be used for regression tests and is not a good indicator of model output quality.
# gptqmodel is integrated into lm-eval >= v0.4.7
pip install lm-eval>=0.4.7
# gptqmodel is integrated into evalplus[main]
pip install -U "evalplus @ git+https://github.com/evalplus/evalplus"
Below is a basic sample using GPTQModel.eval
API
from gptqmodel import GPTQModel
from gptqmodel.utils import EVAL
model_id = "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1"
# Use `lm-eval` as framework to evaluate the model
lm_eval_results = GPTQModel.eval(model_id, framework=EVAL.LM_EVAL, tasks=[EVAL.LM_EVAL.ARC_CHALLENGE], output_file='lm-eval_result.json')
# Use `evalplus` as framework to evaluate the model
evalplus_results = GPTQModel.eval(model_id, framework=EVAL.EVALPLUS, tasks=[EVAL.EVALPLUS.HUMAN], output_file='evalplus_result.json')
@misc{gptqmodel,
author = {ModelCloud.ai and [email protected]},
title = {GPTQModel},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/modelcloud/gptqmodel}},
note = {Contact: [email protected]}
}
@article{frantar-gptq,
title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers},
author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
year={2022},
journal={arXiv preprint arXiv:2210.17323}
}
@article{frantar2024marlin,
title={MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models},
author={Frantar, Elias and Castro, Roberto L and Chen, Jiale and Hoefler, Torsten and Alistarh, Dan},
journal={arXiv preprint arXiv:2408.11743},
year={2024}
}
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for GPTQModel
Similar Open Source Tools
GPTQModel
GPTQModel is an easy-to-use LLM quantization and inference toolkit based on the GPTQ algorithm. It provides support for weight-only quantization and offers features such as dynamic per layer/module flexible quantization, sharding support, and auto-heal quantization errors. The toolkit aims to ensure inference compatibility with HF Transformers, vLLM, and SGLang. It offers various model supports, faster quant inference, better quality quants, and security features like hash check of model weights. GPTQModel also focuses on faster quantization, improved quant quality as measured by PPL, and backports bug fixes from AutoGPTQ.
MaskLLM
MaskLLM is a learnable pruning method that establishes Semi-structured Sparsity in Large Language Models (LLMs) to reduce computational overhead during inference. It is scalable and benefits from larger training datasets. The tool provides examples for running MaskLLM with Megatron-LM, preparing LLaMA checkpoints, pre-tokenizing C4 data for Megatron, generating prior masks, training MaskLLM, and evaluating the model. It also includes instructions for exporting sparse models to Huggingface.
polaris
Polaris establishes a novel, industryβcertified standard to foster the development of impactful methods in AI-based drug discovery. This library is a Python client to interact with the Polaris Hub. It allows you to download Polaris datasets and benchmarks, evaluate a custom method against a Polaris benchmark, and create and upload new datasets and benchmarks.
FalkorDB
FalkorDB is the first queryable Property Graph database to use sparse matrices to represent the adjacency matrix in graphs and linear algebra to query the graph. Primary features: * Adopting the Property Graph Model * Nodes (vertices) and Relationships (edges) that may have attributes * Nodes can have multiple labels * Relationships have a relationship type * Graphs represented as sparse adjacency matrices * OpenCypher with proprietary extensions as a query language * Queries are translated into linear algebra expressions
chatglm.cpp
ChatGLM.cpp is a C++ implementation of ChatGLM-6B, ChatGLM2-6B, ChatGLM3-6B and more LLMs for real-time chatting on your MacBook. It is based on ggml, working in the same way as llama.cpp. ChatGLM.cpp features accelerated memory-efficient CPU inference with int4/int8 quantization, optimized KV cache and parallel computing. It also supports P-Tuning v2 and LoRA finetuned models, streaming generation with typewriter effect, Python binding, web demo, api servers and more possibilities.
vnc-lm
vnc-lm is a Discord bot designed for messaging with language models. Users can configure model parameters, branch conversations, and edit prompts to enhance responses. The bot supports various providers like OpenAI, Huggingface, and Cloudflare Workers AI. It integrates with ollama and LiteLLM, allowing users to access a wide range of language model APIs through a single interface. Users can manage models, switch between models, split long messages, and create conversation branches. LiteLLM integration enables support for OpenAI-compatible APIs and local LLM services. The bot requires Docker for installation and can be configured through environment variables. Troubleshooting tips are provided for common issues like context window problems, Discord API errors, and LiteLLM issues.
KwaiAgents
KwaiAgents is a series of Agent-related works open-sourced by the [KwaiKEG](https://github.com/KwaiKEG) from [Kuaishou Technology](https://www.kuaishou.com/en). The open-sourced content includes: 1. **KAgentSys-Lite**: a lite version of the KAgentSys in the paper. While retaining some of the original system's functionality, KAgentSys-Lite has certain differences and limitations when compared to its full-featured counterpart, such as: (1) a more limited set of tools; (2) a lack of memory mechanisms; (3) slightly reduced performance capabilities; and (4) a different codebase, as it evolves from open-source projects like BabyAGI and Auto-GPT. Despite these modifications, KAgentSys-Lite still delivers comparable performance among numerous open-source Agent systems available. 2. **KAgentLMs**: a series of large language models with agent capabilities such as planning, reflection, and tool-use, acquired through the Meta-agent tuning proposed in the paper. 3. **KAgentInstruct**: over 200k Agent-related instructions finetuning data (partially human-edited) proposed in the paper. 4. **KAgentBench**: over 3,000 human-edited, automated evaluation data for testing Agent capabilities, with evaluation dimensions including planning, tool-use, reflection, concluding, and profiling.
AiOS
AiOS is a tool for human pose and shape estimation, performing human localization and SMPL-X estimation in a progressive manner. It consists of body localization, body refinement, and whole-body refinement stages. Users can download datasets for evaluation, SMPL-X body models, and AiOS checkpoint. Installation involves creating a conda virtual environment, installing PyTorch, torchvision, Pytorch3D, MMCV, and other dependencies. Inference requires placing the video for inference and pretrained models in specific directories. Test results are provided for NMVE, NMJE, MVE, and MPJPE on datasets like BEDLAM and AGORA. Users can run scripts for AGORA validation, AGORA test leaderboard, and BEDLAM leaderboard. The tool acknowledges codes from MMHuman3D, ED-Pose, and SMPLer-X.
langcheck
LangCheck is a Python library that provides a suite of metrics and tools for evaluating the quality of text generated by large language models (LLMs). It includes metrics for evaluating text fluency, sentiment, toxicity, factual consistency, and more. LangCheck also provides tools for visualizing metrics, augmenting data, and writing unit tests for LLM applications. With LangCheck, you can quickly and easily assess the quality of LLM-generated text and identify areas for improvement.
Qwen
Qwen is a series of large language models developed by Alibaba DAMO Academy. It outperforms the baseline models of similar model sizes on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., which evaluate the modelsβ capabilities on natural language understanding, mathematic problem solving, coding, etc. Qwen models outperform the baseline models of similar model sizes on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., which evaluate the modelsβ capabilities on natural language understanding, mathematic problem solving, coding, etc. Qwen-72B achieves better performance than LLaMA2-70B on all tasks and outperforms GPT-3.5 on 7 out of 10 tasks.
airswap-protocols
AirSwap Protocols is a repository containing smart contracts for developers and traders on the AirSwap peer-to-peer trading network. It includes various packages for functionalities like server registry, atomic token swap, staking, rewards pool, batch token and order calls, libraries, and utils. The repository follows a branching and release process for contracts and tools, with steps for regular development process and individual package features or patches. Users can deploy and verify contracts using specific commands with network flags.
PDFMathTranslate
PDFMathTranslate is a tool designed for translating scientific papers and conducting bilingual comparisons. It preserves formulas, charts, table of contents, and annotations. The tool supports multiple languages and diverse translation services. It provides a command-line tool, interactive user interface, and Docker deployment. Users can try the application through online demos. The tool offers various installation methods including command-line, portable, graphic user interface, and Docker. Advanced options allow users to customize translation settings. Additionally, the tool supports secondary development through APIs for Python and HTTP. Future plans include parsing layout with DocLayNet based models, fixing page rotation and format issues, supporting non-PDF/A files, and integrating plugins for Zotero and Obsidian.
TPI-LLM
TPI-LLM (Tensor Parallelism Inference for Large Language Models) is a system designed to bring LLM functions to low-resource edge devices, addressing privacy concerns by enabling LLM inference on edge devices with limited resources. It leverages multiple edge devices for inference through tensor parallelism and a sliding window memory scheduler to minimize memory usage. TPI-LLM demonstrates significant improvements in TTFT and token latency compared to other models, and plans to support infinitely large models with low token latency in the future.
yolo-flutter-app
Ultralytics YOLO for Flutter is a Flutter plugin that allows you to integrate Ultralytics YOLO computer vision models into your mobile apps. It supports both Android and iOS platforms, providing APIs for object detection and image classification. The plugin leverages Flutter Platform Channels for seamless communication between the client and host, handling all processing natively. Before using the plugin, you need to export the required models in `.tflite` and `.mlmodel` formats. The plugin provides support for tasks like detection and classification, with specific instructions for Android and iOS platforms. It also includes features like camera preview and methods for object detection and image classification on images. Ultralytics YOLO thrives on community collaboration and offers different licensing paths for open-source and commercial use cases.
rank_llm
RankLLM is a suite of prompt-decoders compatible with open source LLMs like Vicuna and Zephyr. It allows users to create custom ranking models for various NLP tasks, such as document reranking, question answering, and summarization. The tool offers a variety of features, including the ability to fine-tune models on custom datasets, use different retrieval methods, and control the context size and variable passages. RankLLM is easy to use and can be integrated into existing NLP pipelines.
Consistency_LLM
Consistency Large Language Models (CLLMs) is a family of efficient parallel decoders that reduce inference latency by efficiently decoding multiple tokens in parallel. The models are trained to perform efficient Jacobi decoding, mapping any randomly initialized token sequence to the same result as auto-regressive decoding in as few steps as possible. CLLMs have shown significant improvements in generation speed on various tasks, achieving up to 3.4 times faster generation. The tool provides a seamless integration with other techniques for efficient Large Language Model (LLM) inference, without the need for draft models or architectural modifications.
For similar tasks
GPTQModel
GPTQModel is an easy-to-use LLM quantization and inference toolkit based on the GPTQ algorithm. It provides support for weight-only quantization and offers features such as dynamic per layer/module flexible quantization, sharding support, and auto-heal quantization errors. The toolkit aims to ensure inference compatibility with HF Transformers, vLLM, and SGLang. It offers various model supports, faster quant inference, better quality quants, and security features like hash check of model weights. GPTQModel also focuses on faster quantization, improved quant quality as measured by PPL, and backports bug fixes from AutoGPTQ.
lighteval
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron. We're releasing it with the community in the spirit of building in the open. Note that it is still very much early so don't expect 100% stability ^^' In case of problems or question, feel free to open an issue!
Firefly
Firefly is an open-source large model training project that supports pre-training, fine-tuning, and DPO of mainstream large models. It includes models like Llama3, Gemma, Qwen1.5, MiniCPM, Llama, InternLM, Baichuan, ChatGLM, Yi, Deepseek, Qwen, Orion, Ziya, Xverse, Mistral, Mixtral-8x7B, Zephyr, Vicuna, Bloom, etc. The project supports full-parameter training, LoRA, QLoRA efficient training, and various tasks such as pre-training, SFT, and DPO. Suitable for users with limited training resources, QLoRA is recommended for fine-tuning instructions. The project has achieved good results on the Open LLM Leaderboard with QLoRA training process validation. The latest version has significant updates and adaptations for different chat model templates.
Awesome-Text2SQL
Awesome Text2SQL is a curated repository containing tutorials and resources for Large Language Models, Text2SQL, Text2DSL, Text2API, Text2Vis, and more. It provides guidelines on converting natural language questions into structured SQL queries, with a focus on NL2SQL. The repository includes information on various models, datasets, evaluation metrics, fine-tuning methods, libraries, and practice projects related to Text2SQL. It serves as a comprehensive resource for individuals interested in working with Text2SQL and related technologies.
create-million-parameter-llm-from-scratch
The 'create-million-parameter-llm-from-scratch' repository provides a detailed guide on creating a Large Language Model (LLM) with 2.3 million parameters from scratch. The blog replicates the LLaMA approach, incorporating concepts like RMSNorm for pre-normalization, SwiGLU activation function, and Rotary Embeddings. The model is trained on a basic dataset to demonstrate the ease of creating a million-parameter LLM without the need for a high-end GPU.
StableToolBench
StableToolBench is a new benchmark developed to address the instability of Tool Learning benchmarks. It aims to balance stability and reality by introducing features such as a Virtual API System with caching and API simulators, a new set of solvable queries determined by LLMs, and a Stable Evaluation System using GPT-4. The Virtual API Server can be set up either by building from source or using a prebuilt Docker image. Users can test the server using provided scripts and evaluate models with Solvable Pass Rate and Solvable Win Rate metrics. The tool also includes model experiments results comparing different models' performance.
BetaML.jl
The Beta Machine Learning Toolkit is a package containing various algorithms and utilities for implementing machine learning workflows in multiple languages, including Julia, Python, and R. It offers a range of supervised and unsupervised models, data transformers, and assessment tools. The models are implemented entirely in Julia and are not wrappers for third-party models. Users can easily contribute new models or request implementations. The focus is on user-friendliness rather than computational efficiency, making it suitable for educational and research purposes.
AI-TOD
AI-TOD is a dataset for tiny object detection in aerial images, containing 700,621 object instances across 28,036 images. Objects in AI-TOD are smaller with a mean size of 12.8 pixels compared to other aerial image datasets. To use AI-TOD, download xView training set and AI-TOD_wo_xview, then generate the complete dataset using the provided synthesis tool. The dataset is publicly available for academic and research purposes under CC BY-NC-SA 4.0 license.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.