![kvpress](/statics/github-mark.png)
kvpress
LLM KV cache compression made easy
Stars: 311
![screenshot](/screenshots_githubs/NVIDIA-kvpress.jpg)
This repository implements multiple key-value cache pruning methods and benchmarks using transformers, aiming to simplify the development of new methods for researchers and developers in the field of long-context language models. It provides a set of 'presses' that compress the cache during the pre-filling phase, with each press having a compression ratio attribute. The repository includes various training-free presses, special presses, and supports KV cache quantization. Users can contribute new presses and evaluate the performance of different presses on long-context datasets.
README:
Deploying long-context LLMs is costly due to the linear growth of the key-value (KV) cache in transformer models. For example, handling 1M tokens with Llama 3.1-70B in float16 requires up to 330GB of memory. kvpress implements multiple KV cache compression methods and benchmarks using 🤗 transformers, aiming to simplify the development of new methods for researchers and developers in this field.
pip install kvpress
If possible, install flash attention:
pip install flash-attn --no-build-isolation
kvpress provides a set of "presses" that compress the KV cache during the prefilling-phase. Each press is associated with a compression_ratio
attribute that measures the compression of the cache. The easiest way to use a press is through our custom KVPressTextGenerationPipeline
. It is automatically registered as a transformers pipeline with the name "kv-press-text-generation" when kvpress is imported and handles chat templates and tokenization for you:
from transformers import pipeline
from kvpress import ExpectedAttentionPress
device = "cuda:0"
model = "meta-llama/Llama-3.1-8B-Instruct"
model_kwargs = {"attn_implementation": "flash_attention_2"}
pipe = pipeline("kv-press-text-generation", model=model, device=device, model_kwargs=model_kwargs)
context = "A very long text you want to compress once and for all"
question = "\nA question about the compressed context" # optional
press = ExpectedAttentionPress(compression_ratio=0.5)
answer = pipe(context, question=question, press=press)["answer"]
In the snippet above, the compression is only applied on the context tokens so that you can evaluate the compression for different questions. Check the Wikipedia notebook demo for a more detailed example (also available on Colab here).
[!IMPORTANT]
We focus on compression during the pre-filling phase as the KV cache becomes a bottleneck for long-context sequence (100k - 1M tokens) which are essentially long context prompts. This would typically apply to improving prompt caching systems.
[!NOTE]
Usemodel_kwargs={"attn_implementation":"flash_attention_2"}
to enable flash attention. To use the pressObservedAttentionPress
, you need to specifymodel_kwargs={"attn_implementation":"eager"}
as this press requires to materialize the attention weights
We welcome contributions! To add a new press, simply open an issue or submit a pull request. Check the new_press.ipynb notebook for a step-by-step guide.
All current presses are training free and inherit from BasePress
(source).
Several presses inherit from ScorerPress
(source) and rely on a score to prune the KV pairs with lowest importance:
-
RandomPress
(source): random score -
KnormPress
(source, paper): inverse norm of the key -
SnapKVPress
(source, paper): average attention weight of the last queries -
ExpectedAttentionPress
(source, notebook): expected attention weight during the generation phase -
StreamingLLMPress
(source, paper): keep only the initial and recent tokens -
TOVAPress
(source, paper): attention weight of the last query averaged across heads -
ObservedAttentionPress
(source, paper): average attention weight observed during in pre-filling phase
Some presses rely on a different logic:
-
ThinKPress
(source, paper): compress the dimensions of the keys based on the channel attention score on the last queries -
SimLayerKVPress
(source, paper): identify "lazy" layers, and apply the StreamingLLM approach to them
Finally we provide wrapper presses that can be combined with other presses:
-
AdaKVPress
(source, paper): prune bottom scores of anyScorerPress
but across all heads, achieving head-wise compressions -
PerLayerCompressionPress
(source): compress each layer with a different compression ratio (experimental) -
ComposedPress
(source): compose multiple presses together by chaining their forward hooks -
KeyRerotationPress
(source): rerotate pruned keys to have continuous RoPE embeddings -
ChunkPress
(source, paper): compress the KV cache on each sequence chunk separately. This can yield to more uniform compression across long sequences
For a detailed list of existing KV cache compression methods, check Awesome-KV-Cache-Compression or Awesome-LLM-Compression
The speed_and_memory.ipynb notebook can help you to measure peak memory usage and total time gain.
We provide a simple CLI to evaluate the performance of the different presses on several long-context datasets. Below we report the average performance on the RULER dataset with 4k context length for different presses.
Please refer to the evaluation directory for more details and results.
We support KV cache quantization through the transformers QuantizedCache
class (see HF blog post). To use it, simply pass a cache object to your pipeline:
from transformers import QuantizedCacheConfig, QuantoQuantizedCache
config = QuantizedCacheConfig(nbits=4)
cache = QuantoQuantizedCache(config)
pipe(..., cache=cache)
By default, the DynamicCache
is used (no quantization).
[!IMPORTANT]
To use theQuantizedCache
, you need to install additional dependencies (e.g.pip install optimum-quanto
).
Some presses depend on the model architecture (e.g. ExpectedAttentionPress
or SnapKVPress
) hence they might not work with all models. We tested support for LlamaForCausalLM
, MistralForCausalLM
, Phi3ForCausalLM
and Qwen2ForCausalLM
but many other models might be supported out of the box because their implementation is often similar in transformers.
Memory usage should be reduced by around compression_ratio * kv_cache_size
. As the KV cache is smaller, decoding should also be faster. You can measure peak memory usage gain and total time gain using this notebook.
A press registers a forward hook (press.forward_hook
method) to each attention layer during the pre-filling phase. Registration can be applied using the press as a context manager (press.__call__
method):
import torch
from transformers import AutoModelForCausalLM
from kvpress import KnormPress
device = "cuda:0"
ckpt = "meta-llama/Meta-Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(ckpt).to(device)
press = KnormPress(compression_ratio=0.4)
inputs = model.dummy_inputs["input_ids"].to(device)
with torch.no_grad():
print(model(inputs).past_key_values[0][0].shape)
# torch.Size([3, 8, 5, 128])
with torch.no_grad(), press(model):
print(model(inputs).past_key_values[0][0].shape)
# torch.Size([3, 8, 3, 128])
In fact you can use model.generate
with a press by using the press as a context manager:
with press(model):
outputs = model.generate(inputs)
However, the generate
method does not allow to exclude the question from the compression, which would artificially favors methods such as SnapKV. Ideally, we want a compression method that works whatever comes after the context (e.g. for use cases such as chat or document question answering). Finally the generate
method does not allow to provide generation for multiple questions at once.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for kvpress
Similar Open Source Tools
![kvpress Screenshot](/screenshots_githubs/NVIDIA-kvpress.jpg)
kvpress
This repository implements multiple key-value cache pruning methods and benchmarks using transformers, aiming to simplify the development of new methods for researchers and developers in the field of long-context language models. It provides a set of 'presses' that compress the cache during the pre-filling phase, with each press having a compression ratio attribute. The repository includes various training-free presses, special presses, and supports KV cache quantization. Users can contribute new presses and evaluate the performance of different presses on long-context datasets.
![paper-qa Screenshot](/screenshots_githubs/Future-House-paper-qa.jpg)
paper-qa
PaperQA is a minimal package for question and answering from PDFs or text files, providing very good answers with in-text citations. It uses OpenAI Embeddings to embed and search documents, and includes a process of embedding docs, queries, searching for top passages, creating summaries, using an LLM to re-score and select relevant summaries, putting summaries into prompt, and generating answers. The tool can be used to answer specific questions related to scientific research by leveraging citations and relevant passages from documents.
![paper-qa Screenshot](/screenshots_githubs/whitead-paper-qa.jpg)
paper-qa
PaperQA is a minimal package for question and answering from PDFs or text files, providing very good answers with in-text citations. It uses OpenAI Embeddings to embed and search documents, and follows a process of embedding docs and queries, searching for top passages, creating summaries, scoring and selecting relevant summaries, putting summaries into prompt, and generating answers. Users can customize prompts and use various models for embeddings and LLMs. The tool can be used asynchronously and supports adding documents from paths, files, or URLs.
![LeanCopilot Screenshot](/screenshots_githubs/lean-dojo-LeanCopilot.jpg)
LeanCopilot
Lean Copilot is a tool that enables the use of large language models (LLMs) in Lean for proof automation. It provides features such as suggesting tactics/premises, searching for proofs, and running inference of LLMs. Users can utilize built-in models from LeanDojo or bring their own models to run locally or on the cloud. The tool supports platforms like Linux, macOS, and Windows WSL, with optional CUDA and cuDNN for GPU acceleration. Advanced users can customize behavior using Tactic APIs and Model APIs. Lean Copilot also allows users to bring their own models through ExternalGenerator or ExternalEncoder. The tool comes with caveats such as occasional crashes and issues with premise selection and proof search. Users can get in touch through GitHub Discussions for questions, bug reports, feature requests, and suggestions. The tool is designed to enhance theorem proving in Lean using LLMs.
![curator Screenshot](/screenshots_githubs/bespokelabsai-curator.jpg)
curator
Bespoke Curator is an open-source tool for data curation and structured data extraction. It provides a Python library for generating synthetic data at scale, with features like programmability, performance optimization, caching, and integration with HuggingFace Datasets. The tool includes a Curator Viewer for dataset visualization and offers a rich set of functionalities for creating and refining data generation strategies.
![web-llm Screenshot](/screenshots_githubs/mlc-ai-web-llm.jpg)
web-llm
WebLLM is a modular and customizable javascript package that directly brings language model chats directly onto web browsers with hardware acceleration. Everything runs inside the browser with no server support and is accelerated with WebGPU. WebLLM is fully compatible with OpenAI API. That is, you can use the same OpenAI API on any open source models locally, with functionalities including json-mode, function-calling, streaming, etc. We can bring a lot of fun opportunities to build AI assistants for everyone and enable privacy while enjoying GPU acceleration.
![LayerSkip Screenshot](/screenshots_githubs/facebookresearch-LayerSkip.jpg)
LayerSkip
LayerSkip is an implementation enabling early exit inference and self-speculative decoding. It provides a code base for running models trained using the LayerSkip recipe, offering speedup through self-speculative decoding. The tool integrates with Hugging Face transformers and provides checkpoints for various LLMs. Users can generate tokens, benchmark on datasets, evaluate tasks, and sweep over hyperparameters to optimize inference speed. The tool also includes correctness verification scripts and Docker setup instructions. Additionally, other implementations like gpt-fast and Native HuggingFace are available. Training implementation is a work-in-progress, and contributions are welcome under the CC BY-NC license.
![raft Screenshot](/screenshots_githubs/rapidsai-raft.jpg)
raft
RAFT (Reusable Accelerated Functions and Tools) is a C++ header-only template library with an optional shared library that contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.
![x-lstm Screenshot](/screenshots_githubs/myscience-x-lstm.jpg)
x-lstm
This repository contains an unofficial implementation of the xLSTM model introduced in Beck et al. (2024). It serves as a didactic tool to explain the details of a modern Long-Short Term Memory model with competitive performance against Transformers or State-Space models. The repository also includes a Lightning-based implementation of a basic LLM for multi-GPU training. It provides modules for scalar-LSTM and matrix-LSTM, as well as an xLSTM LLM built using Pytorch Lightning for easy training on multi-GPUs.
![probsem Screenshot](/screenshots_githubs/benlipkin-probsem.jpg)
probsem
ProbSem is a repository that provides a framework to leverage large language models (LLMs) for assigning context-conditional probability distributions over queried strings. It supports OpenAI engines and HuggingFace CausalLM models, and is flexible for research applications in linguistics, cognitive science, program synthesis, and NLP. Users can define prompts, contexts, and queries to derive probability distributions over possible completions, enabling tasks like cloze completion, multiple-choice QA, semantic parsing, and code completion. The repository offers CLI and API interfaces for evaluation, with options to customize models, normalize scores, and adjust temperature for probability distributions.
![ScandEval Screenshot](/screenshots_githubs/ScandEval-ScandEval.jpg)
ScandEval
ScandEval is a framework for evaluating pretrained language models on mono- or multilingual language tasks. It provides a unified interface for benchmarking models on a variety of tasks, including sentiment analysis, question answering, and machine translation. ScandEval is designed to be easy to use and extensible, making it a valuable tool for researchers and practitioners alike.
![LLMUnity Screenshot](/screenshots_githubs/undreamai-LLMUnity.jpg)
LLMUnity
LLM for Unity enables seamless integration of Large Language Models (LLMs) within the Unity engine, allowing users to create intelligent characters for immersive player interactions. The tool supports major LLM models, runs locally without internet access, offers fast inference on CPU and GPU, and is easy to set up with a single line of code. It is free for both personal and commercial use, tested on Unity 2021 LTS, 2022 LTS, and 2023. Users can build multiple AI characters efficiently, use remote servers for processing, and customize model settings for text generation.
![storm Screenshot](/screenshots_githubs/stanford-oval-storm.jpg)
storm
STORM is a LLM system that writes Wikipedia-like articles from scratch based on Internet search. While the system cannot produce publication-ready articles that often require a significant number of edits, experienced Wikipedia editors have found it helpful in their pre-writing stage. **Try out our [live research preview](https://storm.genie.stanford.edu/) to see how STORM can help your knowledge exploration journey and please provide feedback to help us improve the system 🙏!**
![ell Screenshot](/screenshots_githubs/MadcowD-ell.jpg)
ell
ell is a lightweight, functional prompt engineering framework that treats prompts as programs rather than strings. It provides tools for prompt versioning, monitoring, and visualization, as well as support for multimodal inputs and outputs. The framework aims to simplify the process of prompt engineering for language models.
![HuggingFaceGuidedTourForMac Screenshot](/screenshots_githubs/domschl-HuggingFaceGuidedTourForMac.jpg)
HuggingFaceGuidedTourForMac
HuggingFaceGuidedTourForMac is a guided tour on how to install optimized pytorch and optionally Apple's new MLX, JAX, and TensorFlow on Apple Silicon Macs. The repository provides steps to install homebrew, pytorch with MPS support, MLX, JAX, TensorFlow, and Jupyter lab. It also includes instructions on running large language models using HuggingFace transformers. The repository aims to help users set up their Macs for deep learning experiments with optimized performance.
![mimir Screenshot](/screenshots_githubs/iamgroot42-mimir.jpg)
mimir
MIMIR is a Python package designed for measuring memorization in Large Language Models (LLMs). It provides functionalities for conducting experiments related to membership inference attacks on LLMs. The package includes implementations of various attacks such as Likelihood, Reference-based, Zlib Entropy, Neighborhood, Min-K% Prob, Min-K%++, Gradient Norm, and allows users to extend it by adding their own datasets and attacks.
For similar tasks
![kvpress Screenshot](/screenshots_githubs/NVIDIA-kvpress.jpg)
kvpress
This repository implements multiple key-value cache pruning methods and benchmarks using transformers, aiming to simplify the development of new methods for researchers and developers in the field of long-context language models. It provides a set of 'presses' that compress the cache during the pre-filling phase, with each press having a compression ratio attribute. The repository includes various training-free presses, special presses, and supports KV cache quantization. Users can contribute new presses and evaluate the performance of different presses on long-context datasets.
![Co-LLM-Agents Screenshot](/screenshots_githubs/UMass-Foundation-Model-Co-LLM-Agents.jpg)
Co-LLM-Agents
This repository contains code for building cooperative embodied agents modularly with large language models. The agents are trained to perform tasks in two different environments: ThreeDWorld Multi-Agent Transport (TDW-MAT) and Communicative Watch-And-Help (C-WAH). TDW-MAT is a multi-agent environment where agents must transport objects to a goal position using containers. C-WAH is an extension of the Watch-And-Help challenge, which enables agents to send messages to each other. The code in this repository can be used to train agents to perform tasks in both of these environments.
![GPT4Point Screenshot](/screenshots_githubs/Pointcept-GPT4Point.jpg)
GPT4Point
GPT4Point is a unified framework for point-language understanding and generation. It aligns 3D point clouds with language, providing a comprehensive solution for tasks such as 3D captioning and controlled 3D generation. The project includes an automated point-language dataset annotation engine, a novel object-level point cloud benchmark, and a 3D multi-modality model. Users can train and evaluate models using the provided code and datasets, with a focus on improving models' understanding capabilities and facilitating the generation of 3D objects.
![asreview Screenshot](/screenshots_githubs/asreview-asreview.jpg)
asreview
The ASReview project implements active learning for systematic reviews, utilizing AI-aided pipelines to assist in finding relevant texts for search tasks. It accelerates the screening of textual data with minimal human input, saving time and increasing output quality. The software offers three modes: Oracle for interactive screening, Exploration for teaching purposes, and Simulation for evaluating active learning models. ASReview LAB is designed to support decision-making in any discipline or industry by improving efficiency and transparency in screening large amounts of textual data.
![Groma Screenshot](/screenshots_githubs/FoundationVision-Groma.jpg)
Groma
Groma is a grounded multimodal assistant that excels in region understanding and visual grounding. It can process user-defined region inputs and generate contextually grounded long-form responses. The tool presents a unique paradigm for multimodal large language models, focusing on visual tokenization for localization. Groma achieves state-of-the-art performance in referring expression comprehension benchmarks. The tool provides pretrained model weights and instructions for data preparation, training, inference, and evaluation. Users can customize training by starting from intermediate checkpoints. Groma is designed to handle tasks related to detection pretraining, alignment pretraining, instruction finetuning, instruction following, and more.
![amber-train Screenshot](/screenshots_githubs/LLM360-amber-train.jpg)
amber-train
Amber is the first model in the LLM360 family, an initiative for comprehensive and fully open-sourced LLMs. It is a 7B English language model with the LLaMA architecture. The model type is a language model with the same architecture as LLaMA-7B. It is licensed under Apache 2.0. The resources available include training code, data preparation, metrics, and fully processed Amber pretraining data. The model has been trained on various datasets like Arxiv, Book, C4, Refined-Web, StarCoder, StackExchange, and Wikipedia. The hyperparameters include a total of 6.7B parameters, hidden size of 4096, intermediate size of 11008, 32 attention heads, 32 hidden layers, RMSNorm ε of 1e^-6, max sequence length of 2048, and a vocabulary size of 32000.
![kan-gpt Screenshot](/screenshots_githubs/AdityaNG-kan-gpt.jpg)
kan-gpt
The KAN-GPT repository is a PyTorch implementation of Generative Pre-trained Transformers (GPTs) using Kolmogorov-Arnold Networks (KANs) for language modeling. It provides a model for generating text based on prompts, with a focus on improving performance compared to traditional MLP-GPT models. The repository includes scripts for training the model, downloading datasets, and evaluating model performance. Development tasks include integrating with other libraries, testing, and documentation.
![LLM-SFT Screenshot](/screenshots_githubs/yongzhuo-LLM-SFT.jpg)
LLM-SFT
LLM-SFT is a Chinese large model fine-tuning tool that supports models such as ChatGLM, LlaMA, Bloom, Baichuan-7B, and frameworks like LoRA, QLoRA, DeepSpeed, UI, and TensorboardX. It facilitates tasks like fine-tuning, inference, evaluation, and API integration. The tool provides pre-trained weights for various models and datasets for Chinese language processing. It requires specific versions of libraries like transformers and torch for different functionalities.
For similar jobs
![weave Screenshot](/screenshots_githubs/wandb-weave.jpg)
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
![LLMStack Screenshot](/screenshots_githubs/trypromptly-LLMStack.jpg)
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
![VisionCraft Screenshot](/screenshots_githubs/VisionCraft-org-VisionCraft.jpg)
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
![kaito Screenshot](/screenshots_githubs/Azure-kaito.jpg)
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
![PyRIT Screenshot](/screenshots_githubs/Azure-PyRIT.jpg)
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
![tabby Screenshot](/screenshots_githubs/TabbyML-tabby.jpg)
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
![spear Screenshot](/screenshots_githubs/isl-org-spear.jpg)
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
![Magick Screenshot](/screenshots_githubs/Oneirocom-Magick.jpg)
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.