SpinQuant

Code repo for the paper "SpinQuant LLM quantization with learned rotations"

Stars: 76

Visit

SpinQuant is a tool designed for LLM quantization with learned rotations. It focuses on optimizing rotation matrices to enhance the performance of quantized models, narrowing the accuracy gap to full precision models. The tool implements rotation optimization and PTQ evaluation with optimized rotation, providing arguments for model name, batch sizes, quantization bits, and rotation options. SpinQuant is based on the findings that rotation helps in removing outliers and improving quantization, with specific enhancements achieved through learning rotation with Cayley optimization.

README:

SpinQuant

This repository contains the code of SpinQuant introduced in our work: "SpinQuant: LLM Quantization with Learned Rotations"

In this work, we found that

Rotation is a principle way to remove outliers in the LLMs and assist quantization;
Not all rotation helps equally and random rotations produce a large variance in quantized models;
Learning rotation with Cayley optimization greatly enhance the final performance.

As a result, SpinQuant narrows the accuracy gap of W4A4KV4 quantization with full precision to merely 2.9 points for the LLaMA-2 7B model on zero-shot reasoning tasks, surpassing LLM-QAT by 19.1 points and SmoothQuant by 25.0 points.

Citation

If you find our code useful for your research, please consider citing:

@article{liu2024spinquant,
    title={SpinQuant--LLM quantization with learned rotations},
    author={Liu, Zechun and Zhao, Changsheng and Fedorov, Igor and Soran, Bilge and Choudhary, Dhruv and Krishnamoorthi, Raghuraman and Chandra, Vikas and Tian, Yuandong and Blankevoort, Tijmen},
    journal={arXiv preprint arXiv:2405.16406},
    year={2024}
}

Run

1. Requirements:

python 3.9, pytorch >= 2.0
pip install -r requirement.txt
git clone https://github.com/Dao-AILab/fast-hadamard-transform.git
cd fast-hadamard-transform
pip install .

2. Steps to run:

Step 1: Optimize Rotation Matrix

For LLaMA-2 7B/13B and LLaMA-3 8B models:
bash 10_optimize_rotation.sh $model_name $w_bit $a_bit $kv_bit
e.g., bash 10_optimize_rotation.sh meta-llama/Llama-2-7b 4 4 4 for 4-bit weight 4-bit activation and 4-bit kv-cache on Llama-2-7b model.
For LLaMA-2 70B and LLaMA-3 70B models:
bash 11_optimize_rotation_fsdp.sh $model_name $w_bit $a_bit $kv_bit
e.g., bash 11_optimize_rotation_fsdp.sh meta-llama/Llama-2-70b 4 4 4 for 4-bit weight 4-bit activation and 4-bit kv-cache on Llama-2-70b model.

Step 2: Run PTQ evaluation with optimized rotation

bash 2_eval_ptq.sh $model_name $w_bit $a_bit $kv_bit

Others

If using GPTQ quantization method in Step 2 for quantizing both weight and activations, we optimize the rotation matrices with respect to a network where only activations are quantized.
e.g. bash 10_optimize_rotation.sh meta-llama/Llama-2-7b 16 4 4 followed by bash 2_eval_ptq.sh meta-llama/Llama-2-7b 4 4 4 with the --optimized_rotation_path pointing to the rotation optimized for W16A4KV4.

Arguments

--input_model: The model name (or path to the weights)
--output_rotation_path: The local path we want to store the oprimized rotation matrix
--per_device_train_batch_size: The batch size for rotation optimization
--per_device_eval_batch_size: The batch size for PPL evaluation
--a_bits: The number of bits for activation quantization
--w_bits: The number of bits for weight quantization
--v_bits: The number of bits for value quantization
--k_bits: The number of bits for key quantization
--w_clip: Whether using the grid search to find best weight clipping range
--w_rtn: Whether we want to use round-to-nearest quantization. If not having --w_rtn, we are using GPTQ quantization.
--rotate: Whether we want to rotate the model
--optimized_rotation_path: The checkpoint path of optimized rotation; Use random rotation if path is not given

Quantized Models

Model	LLaMA-3 8B		LLaMA-3 70B		LLaMA-2 7B		LLaMA-2 13B		LLaMA-2 70B
Method	Zero-shot	Wiki2	Zero-shot	Wiki2	Zero-shot	Wiki2	Zero-shot	Wiki2	Zero-shot	Wiki2
FloatingPoint	69.6	6.1	74.5	2.8	66.9	5.5	68.3	5.0	72.9	3.3
W4A16KV16
RTN	65.4	7.8	35.5	1e5	63.6	7.2	57.9	6.4	69.2	4.6
SmoothQuant	61.0	10.7	66.9	12.0	59.1	7.5	63.3	6.1	70.2	4.1
LLM-QAT	67.7	7.1	--	--	64.9	5.9	--	--	--	--
GPTQ	66.5	7.2	35.7	1e5	64.5	11.3	64.7	5.6	71.9	3.9
QuaRot	68.4	6.4	70.3	7.9	65.8	5.6	68.3	5.0	72.2	3.5
SpinQuant	68.5	6.4	71.6	4.8	65.9	5.6	68.5	5.0	72.6	3.5
W4A4KV16
RTN	38.5	9e2	35.6	1e5	35.6	2e3	35.3	7e3	35.1	2e5
SmoothQuant	40.3	8e2	55.3	18.0	41.8	2e2	44.9	34.5	64.6	57.1
LLM-QAT	44.9	42.9	--	--	47.8	12.9	--	--	--	--
GPTQ	37.0	9e2	35.3	1e5	36.8	8e3	35.3	5e3	35.5	2e6
QuaRot	63.8	7.9	65.4	20.4	63.5	6.1	66.7	5.4	70.4	3.9
SpinQuant	65.8	7.1	69.5	5.5	64.1	5.9	67.2	5.2	71.0	3.8
W4A4KV4
RTN	38.2	1e3	35.2	1e5	37.1	2e3	35.4	7e3	35.0	2e5
SmoothQuant	38.7	1e3	52.4	22.1	39.0	6e2	40.5	56.6	55.9	10.5
LLM-QAT	43.2	52.5	--	--	44.9	14.9	--	--	--	--
GPTQ	37.1	1e3	35.1	1e5	36.8	9e3	35.2	5e3	35.6	1e6
QuaRot	63.3	8.0	65.1	20.2	62.5	6.4	66.2	5.4	70.3	3.9
SpinQuant	65.2	7.3	69.3	5.5	64.0	5.9	66.9	5.3	71.2	3.8

You can download the optimized rotation matrices here.

Acknowledgement

The results reported in the paper is run with the internal LLaMA codebase in Meta. We reproduced our experiments with HuggingFace codebase and released code here, which partially based on HuggingFace transformers, QuaRot, QuIP# and Optimization-on-Stiefel-Manifold-via-Cayley-Transform.

Contact

Zechun Liu, Reality Labs, Meta Inc (zechunliu at meta dot com)

Changsheng Zhao, Reality Labs, Meta Inc (cszhao at meta dot com)

Relevant Projects

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases [Paper] [Code]

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models [Paper] [Code]

License

BiT is CC-BY-NC 4.0 licensed as of now.

For Tasks:

Click tags to check more tools for each tasks

optimize rotation matrix run ptq evaluation quantize weight and activations rotate the model find best weight clipping range

For Jobs:

machine learning engineer research scientist data scientist ai engineer quantitative analyst

Alternative AI tools for SpinQuant

Similar Open Source Tools

SpinQuant

github

: 76

VoiceBench

VoiceBench is a repository containing code and data for benchmarking LLM-Based Voice Assistants. It includes a leaderboard with rankings of various voice assistant models based on different evaluation metrics. The repository provides setup instructions, datasets, evaluation procedures, and a curated list of awesome voice assistants. Users can submit new voice assistant results through the issue tracker for updates on the ranking list.

github

: 119

PromptFuzz

**Description:** PromptFuzz is an automated tool that generates high-quality fuzz drivers for libraries via a fuzz loop constructed on mutating LLMs' prompts. The fuzz loop of PromptFuzz aims to guide the mutation of LLMs' prompts to generate programs that cover more reachable code and explore complex API interrelationships, which are effective for fuzzing. **Features:** * **Multiply LLM support** : Supports the general LLMs: Codex, Inocder, ChatGPT, and GPT4 (Currently tested on ChatGPT). * **Context-based Prompt** : Construct LLM prompts with the automatically extracted library context. * **Powerful Sanitization** : The program's syntax, semantics, behavior, and coverage are thoroughly analyzed to sanitize the problematic programs. * **Prioritized Mutation** : Prioritizes mutating the library API combinations within LLM's prompts to explore complex interrelationships, guided by code coverage. * **Fuzz Driver Exploitation** : Infers API constraints using statistics and extends fixed API arguments to receive random bytes from fuzzers. * **Fuzz engine integration** : Integrates with grey-box fuzz engine: LibFuzzer. **Benefits:** * **High branch coverage:** The fuzz drivers generated by PromptFuzz achieved a branch coverage of 40.12% on the tested libraries, which is 1.61x greater than _OSS-Fuzz_ and 1.67x greater than _Hopper_. * **Bug detection:** PromptFuzz detected 33 valid security bugs from 49 unique crashes. * **Wide range of bugs:** The fuzz drivers generated by PromptFuzz can detect a wide range of bugs, most of which are security bugs. * **Unique bugs:** PromptFuzz detects uniquely interesting bugs that other fuzzers may miss. **Usage:** 1. Build the library using the provided build scripts. 2. Export the LLM API KEY if using ChatGPT or GPT4. 3. Generate fuzz drivers using the `fuzzer` command. 4. Run the fuzz drivers using the `harness` command. 5. Deduplicate and analyze the reported crashes. **Future Works:** * **Custom LLMs suport:** Support custom LLMs. * **Close-source libraries:** Apply PromptFuzz to close-source libraries by fine tuning LLMs on private code corpus. * **Performance** : Reduce the huge time cost required in erroneous program elimination.

github

: 230

CogVLM2

CogVLM2 is a new generation of open source models that offer significant improvements in benchmarks such as TextVQA and DocVQA. It supports 8K content length, image resolution up to 1344 * 1344, and both Chinese and English languages. The project provides basic calling methods, fine-tuning examples, and OpenAI API format calling examples to help developers quickly get started with the model.

github

: 83

Prompt-Engineering-Holy-Grail

The Prompt Engineering Holy Grail repository is a curated resource for prompt engineering enthusiasts, providing essential resources, tools, templates, and best practices to support learning and working in prompt engineering. It covers a wide range of topics related to prompt engineering, from beginner fundamentals to advanced techniques, and includes sections on learning resources, online courses, books, prompt generation tools, prompt management platforms, prompt testing and experimentation, prompt crafting libraries, prompt libraries and datasets, prompt engineering communities, freelance and job opportunities, contributing guidelines, code of conduct, support for the project, and contact information.

github

: 366

Awesome-LLMs-for-Video-Understanding

Awesome-LLMs-for-Video-Understanding is a repository dedicated to exploring Video Understanding with Large Language Models. It provides a comprehensive survey of the field, covering models, pretraining, instruction tuning, and hybrid methods. The repository also includes information on tasks, datasets, and benchmarks related to video understanding. Contributors are encouraged to add new papers, projects, and materials to enhance the repository.

github

: 1.8k

cool-ai-stuff

This repository contains an uncensored list of free to use APIs and sites for several AI models. > _This list is mainly managed by @zukixa, the queen of zukijourney, so any decisions may have bias!~_ > > **Scroll down for the sites, APIs come first!** * * * > [!WARNING] > We are not endorsing _any_ of the listed services! Some of them might be considered controversial. We are not responsible for any legal, technical or any other damage caused by using the listed services. Data is provided without warranty of any kind. **Use these at your own risk!** * * * # APIs Table of Contents #### Overview of Existing APIs #### Overview of Existing APIs -- Top LLM Models Available #### Overview of Existing APIs -- Top Image Models Available #### Overview of Existing APIs -- Top Other Features & Models Available #### Overview of Existing APIs -- Available Donator Perks * * * ## API List:* *: This list solely covers all providers I (@zukixa) was able to collect metrics in. Any mistakes are not my responsibility, as I am either banned, or not aware of x API. \ 1: Last Updated 4/14/24 ### Overview of APIs: | Service | # of Users1 | Link | Stablity | NSFW Ok? | Open Source? | Owner(s) | Other Notes | | ----------- | ---------- | ------------------------------------------ | ------------------------------------------ | --------------------------- | ------------------------------------------------------ | -------------------------- | ----------------------------------------------------------------------------------------------------------- | | zukijourney| 4441 | D | High | On /unf/, not /v1/ | ✅, Here | @zukixa | Largest & Oldest GPT-4 API still continuously around. Offers other popular AI-related Bots too. | | Hyzenberg| 1234 | D | High | Forbidden | ❌ | @thatlukinhasguy & @voidiii | Experimental sister API to Zukijourney. Successor to HentAI | | NagaAI | 2883 | D | High | Forbidden | ❌ | @zentixua | Honorary successor to ChimeraGPT, the largest API in history (15k users). | | WebRaftAI | 993 | D | High | Forbidden | ❌ | @ds_gamer | Largest API by model count. Provides a lot of service/hosting related stuff too. | | KrakenAI | 388 | D | High | Discouraged | ❌ | @paninico | It is an API of all time. | | ShuttleAI | 3585 | D | Medium | Generally Permitted | ❌ | @xtristan | Faked GPT-4 Before 1, 2 | | Mandrill | 931 | D | Medium | Enterprise-Tier-Only | ❌ | @fredipy | DALL-E-3 access pioneering API. Has some issues with speed & stability nowadays. | oxygen | 742 | D | Medium | Donator-Only | ❌ | @thesketchubuser | Bri'ish 🤮 & Fren'sh 🤮 | | Skailar | 399 | D | Medium | Forbidden | ❌ | @aquadraws | Service is the personification of the word 'feature creep'. Lots of things announced, not much operational. |

github

: 907

Awesome-Knowledge-Distillation-of-LLMs

A collection of papers related to knowledge distillation of large language models (LLMs). The repository focuses on techniques to transfer advanced capabilities from proprietary LLMs to smaller models, compress open-source LLMs, and refine their performance. It covers various aspects of knowledge distillation, including algorithms, skill distillation, verticalization distillation in fields like law, medical & healthcare, finance, science, and miscellaneous domains. The repository provides a comprehensive overview of the research in the area of knowledge distillation of LLMs.

github

: 890

MobileLLM

This repository contains the training code of MobileLLM, a language model optimized for on-device use cases with fewer than a billion parameters. It integrates SwiGLU activation function, deep and thin architectures, embedding sharing, and grouped-query attention to achieve high-quality LLMs. MobileLLM-125M/350M shows significant accuracy improvements over previous models on zero-shot commonsense reasoning tasks. The design philosophy scales effectively to larger models, with state-of-the-art results for MobileLLM-600M/1B/1.5B.

github

: 917

Awesome-Jailbreak-on-LLMs

Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, and exciting jailbreak methods on Large Language Models (LLMs). The repository contains papers, codes, datasets, evaluations, and analyses related to jailbreak attacks on LLMs. It serves as a comprehensive resource for researchers and practitioners interested in exploring various jailbreak techniques and defenses in the context of LLMs. Contributions such as additional jailbreak-related content, pull requests, and issue reports are welcome, and contributors are acknowledged. For any inquiries or issues, contact [email protected]. If you find this repository useful for your research or work, consider starring it to show appreciation.

github

: 507

sane-airscan

sane-airscan is a SANE backend that supports driverless scanning using Apple AirScan (eSCL) and Microsoft WSD protocols. It automatically chooses between the two protocols and has been tested with various devices from Brother, Canon, Dell, Kyocera, Lexmark, Epson, HP, OKI, Panasonic, Pantum, Ricoh, Samsung, and Xerox. The backend allows for automatic and manual device discovery and configuration, supports scanning from platen and ADF in color and grayscale modes, and works with both IPv4 and IPv6. It does not require installation and does not conflict with vendor-provided proprietary software.

github

: 295

gpupixel

GPUPixel is a real-time, high-performance image and video filter library written in C++11 and based on OpenGL/ES. It incorporates a built-in beauty face filter that achieves commercial-grade beauty effects. The library is extremely easy to compile and integrate with a small size, supporting platforms including iOS, Android, Mac, Windows, and Linux. GPUPixel provides various filters like skin smoothing, whitening, face slimming, big eyes, lipstick, and blush. It supports input formats like YUV420P, RGBA, JPEG, PNG, and output formats like RGBA and YUV420P. The library's performance on devices like iPhone and Android is optimized, with low CPU usage and fast processing times. GPUPixel's lib size is compact, making it suitable for mobile and desktop applications.

github

: 1.7k

go-cyber

Cyber is a superintelligence protocol that aims to create a decentralized and censorship-resistant internet. It uses a novel consensus mechanism called CometBFT and a knowledge graph to store and process information. Cyber is designed to be scalable, secure, and efficient, and it has the potential to revolutionize the way we interact with the internet.

github

: 353

llm-deploy

LLM-Deploy focuses on the theory and practice of model/LLM reasoning and deployment, aiming to be your partner in mastering the art of LLM reasoning and deployment. Whether you are a newcomer to this field or a senior professional seeking to deepen your skills, you can find the key path to successfully deploy large language models here. The project covers reasoning and deployment theories, model and service optimization practices, and outputs from experienced engineers. It serves as a valuable resource for algorithm engineers and individuals interested in reasoning deployment.

github

: 140

ZhiLight

ZhiLight is a highly optimized large language model (LLM) inference engine developed by Zhihu and ModelBest Inc. It accelerates the inference of models like Llama and its variants, especially on PCIe-based GPUs. ZhiLight offers significant performance advantages compared to mainstream open-source inference engines. It supports various features such as custom defined tensor and unified global memory management, optimized fused kernels, support for dynamic batch, flash attention prefill, prefix cache, and different quantization techniques like INT8, SmoothQuant, FP8, AWQ, and GPTQ. ZhiLight is compatible with OpenAI interface and provides high performance on mainstream NVIDIA GPUs with different model sizes and precisions.

github

: 832

stylellm_models

**stylellm** is a text style transfer project based on large language models (llms). The project utilizes large language models to learn the writing style of specific literary works (commonly used vocabulary, sentence structure, rhetoric, character dialogue, etc.), forming a series of specific style models. Using the **stylellm** model, the learned style can be transferred to other general texts, that is: input a piece of original text, the model can rewrite it, output text with the characteristics of that style, achieving the effect of text modification,润色or style imitation.

github

: 121

For similar tasks

SpinQuant

github

: 76

For similar jobs

sweep

Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

github

: 7.1k

teams-ai

The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

github

: 502

ai-guide

This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

github

: 159

classifai

Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

github

: 620

chatbot-ui

Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

github

: 27.7k

BricksLLM

BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

github

: 953

uAgents

uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

github

: 1.3k

griptape

Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.

github

: 2.2k