SINQ
Welcome to the official repository of SINQ! A novel, fast and high-quality quantization method designed to make any Large Language Model smaller while preserving accuracy.
Stars: 596
SINQ (Sinkhorn-Normalized Quantization) is a novel, fast, and high-quality quantization method designed to make any Large Language Models smaller while keeping their accuracy almost intact. It offers a model-agnostic quantization technique that delivers state-of-the-art performance for Large Language Models without sacrificing accuracy. With SINQ, users can deploy models that would otherwise be too big, drastically reducing memory usage while preserving LLM quality. The tool quantizes models using dual scaling for better quantization and achieves a more even error distribution, leading to stable behavior across layers and consistently higher accuracy even at very low bit-widths.
README:
![]() |
⚡️ A fast, plug-and-play, model-agnostic quantization technique delivering state-of-the-art performance for Large Language Models without sacrificing accuracy.
💡 Want to run a large model on your GPU but don’t have enough memory? With SINQ, you can deploy models that would otherwise be too big drastically reducing memory usage while preserving LLM quality.
⏱️ SINQ quantizes Qwen3-14B in just ~21 sec and DeepSeekV2.5-236B in ~5 min
🆕 [18/02/2025] SINQ is now integrated into HF Transformers! 🤗
You can now use SINQ in 🤗 Transformers in a super simplified way thanks to our SinqConfig compatible with HF AutoModelForCausalLM()!
More information directly on the HF website here!
🆕 [10/02/2026] A first GGUF model with pre-SINQ! 🤗
First GGUF model using pre-SINQ available in our collection huawei-csl/PreSINQ GGUF collection!
Thanks to our new pre-SINQ algorithm (see details here), we can finally bring the strengths of SINQhorn normalization together with the advantages of GGUF quantization! Many more models coming soon!
You can vote for the next SINQ GGUF model here!
SINQ (Sinkhorn-Normalized Quantization) is a novel, fast and high-quality quantization method designed to make any Large Language Models smaller while keeping their accuracy almost intact.
- 1. How does SINQ work?
- 2. Why should I use SINQ?
- 3. Quantize (and save) any LLM with SINQ
- 4. Run pre-quantized SINQ models from Hugging Face
- 5. How to reproduce paper results
- 6. Pre-SINQ: SINQhorn normalization for GGUFs (and more)!
- 7. Ongoing updates on new features and integrations
- 8. How to Cite This Work
- 9. Related Repositories
| Feature | SINQ | HQQ | A-SINQ | AWQ |
|---|---|---|---|---|
| 🎯 Calibration | Calibration-free | Calibration-free | Calibrated | Calibrated |
| 🧮 Quantization Type | Symmetric & Asymmetric | Asymmetric only | Symmetric & Asymmetric | Symmetric & Asymmetric |
| 📦 NF4 Support | Yes | No | Yes | No |
| ⚡ Quantization Speed | ~2× Faster than HQQ | Slower | ~4× Faster than AWQ | Slower |
| 📈 Model Quality | Higher | Lower | Higher | Lower |
📄 Want to know more? Read our paper on arXiv!
Click to expand a quick explanation of SINQ’s core idea
Conventional quantization uses one scale per weight dimension, which makes models vulnerable to outliers: large weights that distort scaling and cause significant errors.
SINQ solves this by introducing dual scaling: separate scale factors for rows and columns. This flexibility redistributes outlier influence and keeps quantization errors smaller and more balanced.
With standard single-scale quantization, errors tend to cluster around outliers.
With SINQ, they become spread out and less severe, preserving model accuracy even at 3 bit precision. This improvement is driven by SINQ’s Sinkhorn-normalized optimization, which iteratively rescales rows and columns to balance their variance - a process inspired by Sinkhorn matrix normalization. By reducing the overall matrix imbalance (refer to the paper for more info), weights become inherently easier to quantize, leading to more stable behavior across layers and consistently higher accuracy even at very low bit-widths.
Click to expand a quick explanation on why you should use SINQ to quantize your LLM
- Higher LLM quality and ~2× faster quantization than HQQ
- >31× faster quantization process and comparable or better LLM quality compared to AWQ / GPTQ
- Model-agnostic: works without knowing the specific LLM architecture, unlike QuaRot
- Training-free: it does not require end-to-end training, unlike SpinQuant or KurTail
- Additionally, A-SINQ (calibrated) further beats AWQ, GPTQ, and Hadamard+GPTQ on quality while achieving >4× faster quantization time.
Example
- ⏱️ SINQ quantizes Qwen3-14B in just ~21 sec and DeepSeekV2.5-236B in ~5 min on a single GPU
- 💾 Enables you to run DeepSeekV2.5-236B on a single GPU with ~110 GB of memory (vs ~472 GB) while losing < 1 ppl on WikiText2 and C4
There are two ways to use SINQ: directly through the Hugging Face Transformers integration, or by cloning this repository and using the full SINQ implementation.
Option 1) Directly run with HF Transformers
Since SINQ is now integrated into 🤗 Hugging Face Transformers (more info here), you can quantize models directly using the native Transformers API without installing SINQ separately (SINQ only, ASINQ is not supported on HF).
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, SinqConfig
model_name = "Qwen/Qwen3-1.7B"
# Create SINQ quantization config
quant_cfg = SinqConfig(
nbits=4,
group_size=64,
modules_to_not_convert=["lm_head"],
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load and quantize model in one step
qmodel = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quant_cfg,
dtype=torch.bfloat16,
)
# Model is ready for inferenceThis uses the built-in Transformers integration and it requires:
pip install sinq #sinq.__version__ >= 0.1.7.post1First, clone the repository and install the dependencies:
# 1. Clone the repository
git clone https://github.com/huawei-csl/SINQ.git
cd SINQ
# 2. Install dependencies
pip install -r req.txt
# 3. Install SINQ
pip install .Quantizing any 🤗 Hugging Face model with SINQ is simple and takes only a few lines of code:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from sinq.patch_model import AutoSINQHFModel
from sinq.sinqlinear import BaseQuantizeConfig
model_name = "Qwen/Qwen3-1.7B"
device = "cuda:0"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
quant_cfg = BaseQuantizeConfig(
nbits=4, # quantization bit-width
group_size=64, # group size
tiling_mode="1D", # tiling strategy
method="sinq" # quantization method ("asinq" for the calibrated version)
)
qmodel = AutoSINQHFModel.quantize_model(
model,
tokenizer=tokenizer,
quant_config=quant_cfg,
compute_dtype=torch.bfloat16,
device=device
)✅ That’s it. Your model is now quantized with SINQ and ready for inference or saving.
You can further customize the quantization process to balance accuracy and memory for your needs.
Here’s a summary of the main arguments you can tune:
| Flag | Description | Options | Default |
|---|---|---|---|
--nbits |
Bit-width for weight quantization | 2, 3, 4, 5, 6, 8 | 4 |
--tiling_mode |
Weight matrix tiling strategy | 1D, 2D | 1D |
--group_size |
Weights per quantization group | 64, 128 | 64 |
--method |
Quantization method | sinq, asinq | sinq |
💡 Tip: For most cases, the defaults (--nbits 4 --tiling_mode 1D --group_size 64 --method sinq) provide an excellent trade-off between compression and accuracy.
If you want to reuse a quantized model later, save it to disk in HF-style sharded safetensors and reload without needing base FP weights.
Requires:
pip install safetensorsandpip install gemlite==0.5.1.post1
# --- Save to a folder (sharded safetensors) ---
from sinq.patch_model import AutoSINQHFModel
import torch
save_dir = "qwen3-1.7b-sinq-4bit" # any path
# 'model' must already be SINQ-quantized (e.g., via AutoSINQHFModel.quantize_model)
AutoSINQHFModel.save_quantized_safetensors(
qmodel,
tokenizer,
save_dir,
verbose=True,
max_shard_size="4GB", # typical HF shard size (use "8GB" if you prefer)
)# --- Reload later--
from sinq.patch_model import AutoSINQHFModel
import torch
tokenizer = AutoTokenizer.from_pretrained(save_dir)
device = "cuda:0"
qmodel = AutoSINQHFModel.from_quantized_safetensors(
save_dir,
device=device,
compute_dtype=torch.bfloat16,
)✅ Your model is now loaded and ready for inference!
You can optionally compile the model’s forward pass using torch.compile, which can provide a significant speed boost (especially after the first run):
# Warm up to initialize CUDA graphs
_ = qmodel.forward(torch.tensor([[0]], device=device))
# Compile for faster inference
qmodel.forward = torch.compile(
qmodel.forward,
dynamic=True,
fullgraph=False,
backend="inductor",
mode="reduce-overhead",
)⏱️ The first run will take longer because PyTorch compiles optimized kernels, but subsequent runs will be much faster.
Alternative: save & reload as a single .pt file
# --- Save to a folder (.pt) ---
from sinq.patch_model import AutoSINQHFModel
save_dir = "qwen3-1.7b-sinq-4bit" # any path
AutoSINQHFModel.save_quantized(qmodel, tokenizer, save_dir, verbose=True) # creates qmodel.pt# --- Reload later from .pt ---
from sinq.patch_model import AutoSINQHFModel
import torch
tokenizer = AutoTokenizer.from_pretrained(save_dir)
qmodel = AutoSINQHFModel.from_quantized(
save_dir,
device=device,
compute_dtype=torch.bfloat16,
)Compatible with lm-eval evaluation framework
Below is a minimal example showing how to evaluate a SINQ-quantized model on a benchmark dataset:
from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM
# Wrap the already quantized model and tokenizer with HFLM
lm = HFLM(pretrained=qmodel, tokenizer=tokenizer, device=device)
# Evaluate (many tasks available on lm-eval such as MMLU and HellaSwag)
results = evaluator.simple_evaluate(
model=lm,
tasks=["lambada_openai"], # small and fast benchmark
device=device
)We’re publishing a growing collection of pre-quantized SINQ models on 🤗 Hugging Face: huawei-csl / SINQ collection !
import torch
from transformers import AutoTokenizer
from sinq.patch_model import AutoSINQHFModel
model_name = "huawei-csl/<model_name>" # pick a model from the collection
tokenizer = AutoTokenizer.from_pretrained(model_name)
device = "cuda:0"
qmodel = AutoSINQHFModel.from_quantized_safetensors(
model_name,
device=device,
compute_dtype=torch.bfloat16,
)For additional speed (in addition to the one given by gemlite), do a quick warm-up and JIT-compile the forward:
# Warm-up to build shapes
_ = qmodel.forward(torch.tensor([[0]], device=device))
# Compile the forward pass
qmodel.forward = torch.compile(
qmodel.forward,
dynamic=True,
fullgraph=False,
backend="inductor",
mode="reduce-overhead",
)prompt = "Explain neural network quantization in one sentence."
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode():
out_ids = qmodel.generate(**inputs, max_new_tokens=32, do_sample=False)
print(tokenizer.decode(out_ids[0], skip_special_tokens=True))⏱️ The first run will be slower due to kernel/graph compilation. Subsequent runs are much faster!
Click to expand the commands to reproduce the paper results
First, install the dependencies and set up the package:
# 1. Clone the repository
git clone https://github.com/huawei-csl/SINQ.git
cd SINQ
# 2. Install dependencies
pip install -r req.txt
# 3. Install SINQ
pip install .Then run the following command to quantize Qwen3-1.7B out of the box:
cd tests
python quant_model_eval.pyBy default, this will run SINQ with the following settings:
- ✅ 4-bit weight quantization
- ✅ Dual-scale + shift parameterization
- ✅ 1D tiling
- ✅ Group size = 64
Reproduce the core SINQ results (as shown in Table 1 of the paper):
python quant_model_eval.py --model_name Qwen/Qwen3-1.7BThis uses INT4 uniform quantization without calibration - the main benchmark setting of the paper.
Try SINQ with non-uniform quantization (e.g., NF4):
python quant_model_eval.py --method sinq_nf4 --model_name Qwen/Qwen3-1.7BCombine SINQ with activation-aware calibration (AWQ) for higher accuracy:
python quant_model_eval.py --method asinq --model_name Qwen/Qwen3-1.7BCustomize experiments with the following command-line arguments:
| Flag | Description | Options | Default |
|---|---|---|---|
--nbits |
Number of bits used to quantize model weights | 2, 3, 4, 8 | 4 |
--tiling_mode |
Strategy for tiling weight matrices during quantization | 1D, 2D | 1D |
--group_size |
Number of weights processed together as a quantization group | 64, 128 | 64 |
📝 Note: All results reported in the paper were obtained using the evaluation framework from Efficient-ML/Qwen3-Quantization rather than
lm-eval.
Pre-SINQ is a model-agnostic reparameterization algorithm that applies the Sinkhorn-inspired normalization used in SINQ to make model weights easier to quantize while fully preserving the model’s function (e.g., the output of the pre-SINQ model is mathematically identical to the one of the original model and introduces no computation or memory overhead). Pre-SINQ leaves you free to choose your preferred quantizer, being fully compatible with GGUF, AWQ, GPTQ, and HQQ.
Example of pre-SINQ for an MLP block of an LLM. Scales are computed with our sinkhorn-inspired algorithm and abosrbed into the model weights.
- S2 scales are computed using our Sinkhorn-inspired algorithm and absorbed directly into the model.
- The transformed model can then be quantized with any existing technique.
- Pre-SINQ GGUF models are available here (and we’re continuously adding more!).
- More information and sample code to create Pre-SINQ GGUF models in
./presinq_models_generation!
Vote to select the next SINQ GGUF model here!
We are actively expanding SINQ with new features and integrations. Stay tuned here for the latest updates:
- [26/09/2025] - SINQ paper released on arXiv
- [30/09/2025] - SINQ GitHub repository made public
- [02/10/2025] - SINQ paper featured on 🤗 Hugging Face Papers
- [17/10/2025] - First pre-quantized SINQ models available on 🤗Hugging Face Hub!
- [23/10/2025] - Faster inference with gemlite backend (4-bit 1D tiling)
- [10/02/2026] - First pre-SINQ GGUF model available on here!
- [18/02/2026] - SINQ is now part of 🤗 Hugging Face Transformers. More info here!
- 🔜 Coming soon - Support for Conv2D layers and timm models for computer vision tasks
- 🔜 Work in progress - Support for mixed-precision quantization (combine multiple bitwidths for optimal accuracy-efficiency balance)
- 🔜 Work in progress - We’re actively working to provide support for popular frameworks such as
vLLM,SGLang.
If you find SINQ useful in your research or applications, please cite our paper:
@misc{muller2025sinq,
title={SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights},
author={Lorenz K. Muller and Philippe Bich and Jiawei Zhuang and Ahmet Celik and Luca Benfenati and Lukas Cavigelli},
year={2025},
eprint={2509.22944},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={http://arxiv.org/abs/2509.22944}
}This project builds upon and extends the excellent work from the following open-source projects:
- Qwen3-Quantization - Base implementation and evaluation scripts for Qwen3 quantization.
- HQQ - High-quality calibration-free quantization baseline.
📜 You can find their original licenses in the corresponding LICENSE files in these repositories.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for SINQ
Similar Open Source Tools
SINQ
SINQ (Sinkhorn-Normalized Quantization) is a novel, fast, and high-quality quantization method designed to make any Large Language Models smaller while keeping their accuracy almost intact. It offers a model-agnostic quantization technique that delivers state-of-the-art performance for Large Language Models without sacrificing accuracy. With SINQ, users can deploy models that would otherwise be too big, drastically reducing memory usage while preserving LLM quality. The tool quantizes models using dual scaling for better quantization and achieves a more even error distribution, leading to stable behavior across layers and consistently higher accuracy even at very low bit-widths.
lighteval
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron. We're releasing it with the community in the spirit of building in the open. Note that it is still very much early so don't expect 100% stability ^^' In case of problems or question, feel free to open an issue!
mistral.rs
Mistral.rs is a fast LLM inference platform written in Rust. We support inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.
automem
AutoMem is a production-grade long-term memory system for AI assistants, achieving 90.53% accuracy on the LoCoMo benchmark. It combines FalkorDB (Graph) and Qdrant (Vectors) storage systems to store, recall, connect, learn, and perform with memories. AutoMem enables AI assistants to remember, connect, and evolve their understanding over time, similar to human long-term memory. It implements techniques from peer-reviewed memory research and offers features like multi-hop bridge discovery, knowledge graphs that evolve, 9-component hybrid scoring, memory consolidation cycles, background intelligence, 11 relationship types, and more. AutoMem is benchmark-proven, research-validated, and production-ready, with features like sub-100ms recall, concurrent writes, automatic retries, health monitoring, dual storage redundancy, and automated backups.
cog
Cog is an open-source tool that lets you package machine learning models in a standard, production-ready container. You can deploy your packaged model to your own infrastructure, or to Replicate.
indexify
Indexify is an open-source engine for building fast data pipelines for unstructured data (video, audio, images, and documents) using reusable extractors for embedding, transformation, and feature extraction. LLM Applications can query transformed content friendly to LLMs by semantic search and SQL queries. Indexify keeps vector databases and structured databases (PostgreSQL) updated by automatically invoking the pipelines as new data is ingested into the system from external data sources. **Why use Indexify** * Makes Unstructured Data **Queryable** with **SQL** and **Semantic Search** * **Real-Time** Extraction Engine to keep indexes **automatically** updated as new data is ingested. * Create **Extraction Graph** to describe **data transformation** and extraction of **embedding** and **structured extraction**. * **Incremental Extraction** and **Selective Deletion** when content is deleted or updated. * **Extractor SDK** allows adding new extraction capabilities, and many readily available extractors for **PDF**, **Image**, and **Video** indexing and extraction. * Works with **any LLM Framework** including **Langchain**, **DSPy**, etc. * Runs on your laptop during **prototyping** and also scales to **1000s of machines** on the cloud. * Works with many **Blob Stores**, **Vector Stores**, and **Structured Databases** * We have even **Open Sourced Automation** to deploy to Kubernetes in production.
BrowserAI
BrowserAI is a tool that allows users to run large language models (LLMs) directly in the browser, providing a simple, fast, and open-source solution. It prioritizes privacy by processing data locally, is cost-effective with no server costs, works offline after initial download, and offers WebGPU acceleration for high performance. It is developer-friendly with a simple API, supports multiple engines, and comes with pre-configured models for easy use. Ideal for web developers, companies needing privacy-conscious AI solutions, researchers experimenting with browser-based AI, and hobbyists exploring AI without infrastructure overhead.
VisionDepth3D
VisionDepth3D is an all-in-one 3D suite for creators, combining AI depth and custom stereo logic for cinema in VR. The suite includes features like real-time 3D stereo composer with CUDA + PyTorch acceleration, AI-powered depth estimation supporting 25+ models, AI upscaling & interpolation, depth blender for blending depth maps, audio to video sync, smart GUI workflow, and various output formats & aspect ratios. The tool is production-ready, offering advanced parallax controls, streamlined export for cinema, VR, or streaming, and real-time preview overlays.
executorch
ExecuTorch is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch Edge ecosystem and enables efficient deployment of PyTorch models to edge devices. Key value propositions of ExecuTorch are: * **Portability:** Compatibility with a wide variety of computing platforms, from high-end mobile phones to highly constrained embedded systems and microcontrollers. * **Productivity:** Enabling developers to use the same toolchains and SDK from PyTorch model authoring and conversion, to debugging and deployment to a wide variety of platforms. * **Performance:** Providing end users with a seamless and high-performance experience due to a lightweight runtime and utilizing full hardware capabilities such as CPUs, NPUs, and DSPs.
ebook2audiobook
ebook2audiobook is a CPU/GPU converter tool that converts eBooks to audiobooks with chapters and metadata using tools like Calibre, ffmpeg, XTTSv2, and Fairseq. It supports voice cloning and a wide range of languages. The tool is designed to run on 4GB RAM and provides a new v2.0 Web GUI interface for user-friendly interaction. Users can convert eBooks to text format, split eBooks into chapters, and utilize high-quality text-to-speech functionalities. Supported languages include Arabic, Chinese, English, French, German, Hindi, and many more. The tool can be used for legal, non-DRM eBooks only and should be used responsibly in compliance with applicable laws.
mindnlp
MindNLP is an open-source NLP library based on MindSpore. It provides a platform for solving natural language processing tasks, containing many common approaches in NLP. It can help researchers and developers to construct and train models more conveniently and rapidly. Key features of MindNLP include: * Comprehensive data processing: Several classical NLP datasets are packaged into a friendly module for easy use, such as Multi30k, SQuAD, CoNLL, etc. * Friendly NLP model toolset: MindNLP provides various configurable components. It is friendly to customize models using MindNLP. * Easy-to-use engine: MindNLP simplified complicated training process in MindSpore. It supports Trainer and Evaluator interfaces to train and evaluate models easily. MindNLP supports a wide range of NLP tasks, including: * Language modeling * Machine translation * Question answering * Sentiment analysis * Sequence labeling * Summarization MindNLP also supports industry-leading Large Language Models (LLMs), including Llama, GLM, RWKV, etc. For support related to large language models, including pre-training, fine-tuning, and inference demo examples, you can find them in the "llm" directory. To install MindNLP, you can either install it from Pypi, download the daily build wheel, or install it from source. The installation instructions are provided in the documentation. MindNLP is released under the Apache 2.0 license. If you find this project useful in your research, please consider citing the following paper: @misc{mindnlp2022, title={{MindNLP}: a MindSpore NLP library}, author={MindNLP Contributors}, howpublished = {\url{https://github.com/mindlab-ai/mindnlp}}, year={2022} }
figma-console-mcp
Figma Console MCP is a Model Context Protocol server that bridges design and development, giving AI assistants complete access to Figma for extraction, creation, and debugging. It connects AI assistants like Claude to Figma, enabling plugin debugging, visual debugging, design system extraction, design creation, variable management, real-time monitoring, and three installation methods. The server offers 53+ tools for NPX and Local Git setups, while Remote SSE provides read-only access with 16 tools. Users can create and modify designs with AI, contribute to projects, or explore design data. The server supports authentication via personal access tokens and OAuth, and offers tools for navigation, console debugging, visual debugging, design system extraction, design creation, design-code parity, variable management, and AI-assisted design creation.
crawl4ai
Crawl4AI is a powerful and free web crawling service that extracts valuable data from websites and provides LLM-friendly output formats. It supports crawling multiple URLs simultaneously, replaces media tags with ALT, and is completely free to use and open-source. Users can integrate Crawl4AI into Python projects as a library or run it as a standalone local server. The tool allows users to crawl and extract data from specified URLs using different providers and models, with options to include raw HTML content, force fresh crawls, and extract meaningful text blocks. Configuration settings can be adjusted in the `crawler/config.py` file to customize providers, API keys, chunk processing, and word thresholds. Contributions to Crawl4AI are welcome from the open-source community to enhance its value for AI enthusiasts and developers.
transformerlab-app
Transformer Lab is an app that allows users to experiment with Large Language Models by providing features such as one-click download of popular models, finetuning across different hardware, RLHF and Preference Optimization, working with LLMs across different operating systems, chatting with models, using different inference engines, evaluating models, building datasets for training, calculating embeddings, providing a full REST API, running in the cloud, converting models across platforms, supporting plugins, embedded Monaco code editor, prompt editing, inference logs, all through a simple cross-platform GUI.
monty
Monty is a minimal, secure Python interpreter written in Rust for use by AI. It allows safe execution of Python code written by an LLM embedded in your agent, with fast startup times and performance similar to CPython. Monty supports running a subset of Python code, blocking access to the host environment, calling host functions, typechecking, snapshotting interpreter state, controlling resource usage, collecting stdout and stderr, and running async or sync code. It is designed for running code written by agents, providing a sandboxed environment without the complexity of a full container-based solution.
agentfield
AgentField is an open-source control plane designed for autonomous AI agents, providing infrastructure for agents to make decisions beyond chatbots. It offers features like scaling infrastructure, routing & discovery, async execution, durable state, observability, trust infrastructure with cryptographic identity, verifiable credentials, and policy enforcement. Users can write agents in Python, Go, TypeScript, or interact via REST APIs. The tool enables the creation of AI backends that reason autonomously within defined boundaries, offering predictability and flexibility. AgentField aims to bridge the gap between AI frameworks and production-ready infrastructure for AI agents.
For similar tasks
SINQ
SINQ (Sinkhorn-Normalized Quantization) is a novel, fast, and high-quality quantization method designed to make any Large Language Models smaller while keeping their accuracy almost intact. It offers a model-agnostic quantization technique that delivers state-of-the-art performance for Large Language Models without sacrificing accuracy. With SINQ, users can deploy models that would otherwise be too big, drastically reducing memory usage while preserving LLM quality. The tool quantizes models using dual scaling for better quantization and achieves a more even error distribution, leading to stable behavior across layers and consistently higher accuracy even at very low bit-widths.
aimet
AIMET is a library that provides advanced model quantization and compression techniques for trained neural network models. It provides features that have been proven to improve run-time performance of deep learning neural network models with lower compute and memory requirements and minimal impact to task accuracy. AIMET is designed to work with PyTorch, TensorFlow and ONNX models. We also host the AIMET Model Zoo - a collection of popular neural network models optimized for 8-bit inference. We also provide recipes for users to quantize floating point models using AIMET.
AutoGPTQ
AutoGPTQ is an easy-to-use LLM quantization package with user-friendly APIs, based on GPTQ algorithm (weight-only quantization). It provides a simple and efficient way to quantize large language models (LLMs) to reduce their size and computational cost while maintaining their performance. AutoGPTQ supports a wide range of LLM models, including GPT-2, GPT-J, OPT, and BLOOM. It also supports various evaluation tasks, such as language modeling, sequence classification, and text summarization. With AutoGPTQ, users can easily quantize their LLM models and deploy them on resource-constrained devices, such as mobile phones and embedded systems.
hqq
HQQ is a fast and accurate model quantizer that skips the need for calibration data. It's super simple to implement (just a few lines of code for the optimizer). It can crunch through quantizing the Llama2-70B model in only 4 minutes! 🚀
Efficient-LLMs-Survey
This repository provides a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from **model-centric** , **data-centric** , and **framework-centric** perspective, respectively. We hope our survey and this GitHub repository can serve as valuable resources to help researchers and practitioners gain a systematic understanding of the research developments in efficient LLMs and inspire them to contribute to this important and exciting field.
Qwen-TensorRT-LLM
Qwen-TensorRT-LLM is a project developed for the NVIDIA TensorRT Hackathon 2023, focusing on accelerating inference for the Qwen-7B-Chat model using TRT-LLM. The project offers various functionalities such as FP16/BF16 support, INT8 and INT4 quantization options, Tensor Parallel for multi-GPU parallelism, web demo setup with gradio, Triton API deployment for maximum throughput/concurrency, fastapi integration for openai requests, CLI interaction, and langchain support. It supports models like qwen2, qwen, and qwen-vl for both base and chat models. The project also provides tutorials on Bilibili and blogs for adapting Qwen models in NVIDIA TensorRT-LLM, along with hardware requirements and quick start guides for different model types and quantization methods.
LLM-Alchemy-Chamber
LLM Alchemy Chamber is a repository dedicated to exploring the world of Language Models (LLMs) through various experiments and projects. It contains scripts, notebooks, and experiments focused on tasks such as fine-tuning different LLM models, quantization for performance optimization, dataset generation for instruction/QA tasks, and more. The repository offers a collection of resources for beginners and enthusiasts interested in delving into the mystical realm of LLMs.
LLamaTuner
LLamaTuner is a repository for the Efficient Finetuning of Quantized LLMs project, focusing on building and sharing instruction-following Chinese baichuan-7b/LLaMA/Pythia/GLM model tuning methods. The project enables training on a single Nvidia RTX-2080TI and RTX-3090 for multi-round chatbot training. It utilizes bitsandbytes for quantization and is integrated with Huggingface's PEFT and transformers libraries. The repository supports various models, training approaches, and datasets for supervised fine-tuning, LoRA, QLoRA, and more. It also provides tools for data preprocessing and offers models in the Hugging Face model hub for inference and finetuning. The project is licensed under Apache 2.0 and acknowledges contributions from various open-source contributors.
For similar jobs
sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.
teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.
chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.
BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students
uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.
griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.


