zipnn
A Lossless Compression Library for AI pipelines
Stars: 151
ZipNN is a lossless and near-lossless compression library optimized for numbers/tensors in the Foundation Models environment. It automatically prepares data for compression based on its type, allowing users to focus on core tasks without worrying about compression complexities. The library delivers effective compression techniques for different data types and structures, achieving high compression ratios and rates. ZipNN supports various compression methods like ZSTD, lz4, and snappy, and provides ready-made scripts for file compression/decompression. Users can also manually import the package to compress and decompress data. The library offers advanced configuration options for customization and validation tests for different input and compression types.
README:
TL;DR - simple, fast, and effective model compression
Try out yourself the compressed ibm-granite granite-7b-instruct hosted on Hugging Face:
pip install zipnn
from transformers import AutoTokenizer, AutoModelForCausalLM
from zipnn import zipnn_hf
zipnn_hf()
tokenizer = AutoTokenizer.from_pretrained("royleibov/granite-7b-instruct-ZipNN-Compressed")
model = AutoModelForCausalLM.from_pretrained("royleibov/granite-7b-instruct-ZipNN-Compressed")
ZipNN also allows you to seamlessly save local disk space in your cache after the model is downloaded.
To compress the cached model, simply run:
python zipnn_compress_path.py safetensors --model royleibov/granite-7b-instruct-ZipNN-Compressed --hf_cache
The model will be decompressed automatically and safely as long as zipnn_hf()
is added at the top of the file like in the example above.
To decompress manually, simply run:
python zipnn_decompress_path.py --model royleibov/granite-7b-instruct-ZipNN-Compressed --hf_cache
You can try other state-of-the-art compressed models from the updating list below:
You can also try one of these python notebooks hosted on Kaggle: granite 3b, Llama 3.2, phi 3.5.
Click here to explore other examples of compressed models hosted on Hugging Face
Click here to see full Hugging Face integration documentation
Download the scripts for compressing/decompressing AI Models:
wget -i https://raw.githubusercontent.com/zipnn/zipnn/main/scripts/scripts.txt
To compress a file:
python3 zipnn_compress_file.py model_name
To decompress a file:
python3 zipnn_decompress_file.py compressed_model_name.znn
In the realm of data compression, achieving a high compression/decompression ratio often requires careful consideration of the data types and the nature of the datasets being compressed. For instance, different strategies may be optimal for floating-point numbers compared to integers, and datasets in monotonic order may benefit from distinct preparations.
ZipNN (The NN stands for Neural Networks) is a lossless compression library optimized for numbers/tensors in the Foundation Models environment, designed to automatically prepare the data for compression according to its type. By simply calling zipnn.compress(data), users can rely on the package to apply the most effective compression technique under the hood.
Click here to explore the options we use for different datasets and data types
Given a specific data set, ZipNN automatically rearranges the data according to it's type, and applies the most effective techniques for the given instance to improve compression ratios and speed. It is especially effective for BF16 models, typically saving 33% of the model size, whereas with models of type FP32 it usually reduces the model size by 17%.
Some of the techniques employed in ZipNN are described in our paper: Lossless and Near-Lossless Compression for Foundation Models A follow up version with a more complete description is under preparation.
Currently, ZipNN compression methods are implemented on CPUs, and GPU implementations are on the way.
Below is a comparison of compression results between ZipNN and several other methods on bfloat16 data.
Compressor name | Compression ratio / Output size | Compression Throughput | Decompression Throughput |
---|---|---|---|
ZipNN v0.2.0 | 1.51 / 66.3% | 1120MB/sec | 1660MB/sec |
ZSTD v1.56 | 1.27 / 78.3% | 785MB/sec | 950MB/sec |
LZ4 | 1 / 100% | --- | --- |
Snappy | 1 / 100% | --- | --- |
- Gzip, Zlib compression rate are similar to ZSTD, but much slower.
- The above results are for a single-threaded compression (Working with chunks size of 256KB).
- Similar results with other BF16 Models such as Mistral, Lamma-3, Lamma-3.1, Arcee-Nova and Jamba.
pip install zipnn
git clone [email protected]:zipnn/zipnn.git
cd zipnn
We are using two submodules:
- Cyan4973/FiniteStateEntropy [https://github.com/Cyan4973/FiniteStateEntropy]
- facebok/zstd [https://github.com/facebook/zstd] tag 1.5.6
git submodule update --init --recursive
Compile locally using pip
pip install -e .
This project requires the following Python packages:
- numpy
- zstandard
- torch
You can integrate zipnn compression and decompression into your own projects by utilizing the scripts available in the scripts folder. This folder contains the following scripts:
-
zipnn_compress_file.py
: For compressing an individual file. -
zipnn_decompress_file.py
: For decompressing an individual file. -
zipnn_compress_path.py
: For compressing all files under a path. -
zipnn_decompress_path.py
: For decompressing all files under a path.
Compress one file:
python zipnn_compress_file.py model_name
Decompress one file:
python zipnn_decompress_file.py model_name.znn
For detailed information on how to use these scripts, please refer to the README.md file located in the scripts folder.
You can use the package manually, like so:
Import zipnn:
from zipnn import ZipNN
Instance class:
zpn = ZipNN(method='zstd', input_format='torch')
Create a 1MB tensor with random numbers from a uniform distribution between -1 and 1 The dtype is bfloat
import torch
original_tensor = torch.rand(10124*1024, dtype=torch.bfloat16) * 2 - 1
Compression:
compressed_data = zpn.compress(original_tensor)
Decompression:
decompressed_data = zpn.decompress(compressed_data)
Check for correctness:
torch.equal(original_tensor, decompressed_data)
In this example, ZipNN and ZSTD compress and decompress 1GB of the Granite model and validate that the original file and the decompressed file are equal.
The script reads the file and compresses and decompresses in Byte format.
> python3 simple_example_granite.py
...
Are the original and decompressed byte strings the same [BYTE]? True
In this example, ZipNN compresses a full model hosted on the Hugging Face AI-Hub.
From the model's directory (which can be forked locally. Make sure you git lfs pull upstream
before continuing) run:
python3 zipnn_compress_path.py safetensors --path .
Add the compressed weights to git-lfs tracking
git lfs track "*.znn" &&
sed -i 's/.safetensors/.safetensors.znn/g' model.safetensors.index.json &&
git add *.znn .gitattributes model.safetensors.index.json &&
git rm *.safetensors
Done! Now push the changes as per the documentation.
To use the model simply run our ZipNN Hugging Face method before proceeding as normal:
from zipnn import zipnn_hf
zipnn_hf()
# Load the model from your compressed Hugging Face model card as you normally would
...
You can test Jamba-v0.1-ZipNN-Compressed and granite-7b-instruct-ZipNN-Compressed yourself (both compressed to 67% their original sizes - which could save ~1PB for ai21labs Jamba-v0.1 and ~30TB for ibm-granite granite-7b-instruct of monthly downloads).
The default configuration is ByteGrouping of 4 with vanilla ZSTD (running with 8 threads), and the input and outputs are "byte". For more advanced options, please consider the following parameters:
-
method
: Compression method, Supporting zstd, lz4, snappy (default value = 'zstd'). -
input_format
: The input data format, can be one of the following: torch, numpy, byte (default value = 'byte'). -
bytearray_dtype
: The data type of the byte array, if input_format is 'byte'. If input_format is torch or numpy, the dtype will be derived from the data automatically (default value = 'float32'). -
threads
: The maximum threads for the compression and the bit manipulation. If 0, the code decides according to the dataset length (default value = 1). -
compression_threshold
: Save original buffer if not compress above the threshold (default value = 0.95). -
check_th_after_percent
: Check the compression threshold after % from the number of chunk and stop compressing if not pass the compression_threshold. (default value = 10[%]). -
byte_reorder
: Number of grouping. The format is the following:-
Bit Format:
-
[7]
- Group 0/1: 4th Byte -
[6-5]
- Group 0/1/2: 3rd Byte -
[4-3]
- Group 0/1/2/3: 2nd Byte -
[2-0]
- Group 0/1/2/3/4: 1st Byte
-
-
Examples:
- bg16: Two groups -
0_00_01_010
(decimal 10) - fp32: Four groups -
1_10_11_100
(decimal 220) - int32: Truncate two MSBs -
0_00_01_001
(decimal 9)
- bg16: Two groups -
-
-
reorder_signbit
: This parameter controls the reordering of the sign bit for float32 or bfloat16 to improve compression. Options are:-
255
: No reordering of the sign bit. -
16
: Reorders the sign bit for bfloat16. -
32
: Reorders the sign bit for float32. -
0
: Automatically decides based on the data type (default value = 0).
-
-
compression_chunk
: Chunk size for compression. (default value = 256KB).
Click here to explore additional ZipNN configuration options
Run tests for Byte/File input types, Byte/File compression types, Byte/File decompression types.
python3 -m unittest discover -s tests/ -p test_suit.py
We are excited to hear your feedback!
For issues and feature requests, please open a GitHub issue.
We welcome and value all contributions to the project! You can contact us in this email: [email protected]
- Add float32 to the C implementation with Huffman compression.
-
Plugin for Hugging Face transformers to allow using from_pretrained and decompressing the model after downloading it from Hugging Face.
-
Add Delta compression support in python -> save Xor between two models and compress them).
- Change ZipNN suffix from .zpn to .znn
-
Prepare dtype16 (BF16 and FP16) for multi-threading by changing its C logic. For each chunk, byte ordering, bit ordering, and compression are processed separately.
-
Integrate the Streaming support into zipnn python code.
-
Add support for Streaming when using outside scripts
-
Fix bug: Compression didn't work when compressing files larger than 3GB
-
Change the byte ordering implementation to C (for better performance).
-
Change the bfloat16/float16 implementation to a C implementation with Huffman encoding, running on chunks of 256KB each.
-
Float 32 using ZSTD compression as in v0.1.1
-
Add support with uint32 with ZSTD compression.
- Python implementation of compressing Models, float32, float15, bfloat16 with byte ordering and ZSTD.
@article{hershcovitch2024lossless,
title={Lossless and Near-Lossless Compression for Foundation Models},
author={Hershcovitch, Moshik and Choshen, Leshem and Wood, Andrew and Enmouri, Ilias and Chin, Peter and Sundararaman, Swaminathan and Harnik, Danny},
journal={arXiv preprint arXiv:2404.15198},
year={2024}
}
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for zipnn
Similar Open Source Tools
zipnn
ZipNN is a lossless and near-lossless compression library optimized for numbers/tensors in the Foundation Models environment. It automatically prepares data for compression based on its type, allowing users to focus on core tasks without worrying about compression complexities. The library delivers effective compression techniques for different data types and structures, achieving high compression ratios and rates. ZipNN supports various compression methods like ZSTD, lz4, and snappy, and provides ready-made scripts for file compression/decompression. Users can also manually import the package to compress and decompress data. The library offers advanced configuration options for customization and validation tests for different input and compression types.
LLM-Pruner
LLM-Pruner is a tool for structural pruning of large language models, allowing task-agnostic compression while retaining multi-task solving ability. It supports automatic structural pruning of various LLMs with minimal human effort. The tool is efficient, requiring only 3 minutes for pruning and 3 hours for post-training. Supported LLMs include Llama-3.1, Llama-3, Llama-2, LLaMA, BLOOM, Vicuna, and Baichuan. Updates include support for new LLMs like GQA and BLOOM, as well as fine-tuning results achieving high accuracy. The tool provides step-by-step instructions for pruning, post-training, and evaluation, along with a Gradio interface for text generation. Limitations include issues with generating repetitive or nonsensical tokens in compressed models and manual operations for certain models.
paxml
Pax is a framework to configure and run machine learning experiments on top of Jax.
FlexFlow
FlexFlow Serve is an open-source compiler and distributed system for **low latency**, **high performance** LLM serving. FlexFlow Serve outperforms existing systems by 1.3-2.0x for single-node, multi-GPU inference and by 1.4-2.4x for multi-node, multi-GPU inference.
camel
CAMEL is an open-source library designed for the study of autonomous and communicative agents. We believe that studying these agents on a large scale offers valuable insights into their behaviors, capabilities, and potential risks. To facilitate research in this field, we implement and support various types of agents, tasks, prompts, models, and simulated environments.
evalverse
Evalverse is an open-source project designed to support Large Language Model (LLM) evaluation needs. It provides a standardized and user-friendly solution for processing and managing LLM evaluations, catering to AI research engineers and scientists. Evalverse supports various evaluation methods, insightful reports, and no-code evaluation processes. Users can access unified evaluation with submodules, request evaluations without code via Slack bot, and obtain comprehensive reports with scores, rankings, and visuals. The tool allows for easy comparison of scores across different models and swift addition of new evaluation tools.
OpenAdapt
OpenAdapt is an open-source software adapter between Large Multimodal Models (LMMs) and traditional desktop and web Graphical User Interfaces (GUIs). It aims to automate repetitive GUI workflows by leveraging the power of LMMs. OpenAdapt records user input and screenshots, converts them into tokenized format, and generates synthetic input via transformer model completions. It also analyzes recordings to generate task trees and replay synthetic input to complete tasks. OpenAdapt is model agnostic and generates prompts automatically by learning from human demonstration, ensuring that agents are grounded in existing processes and mitigating hallucinations. It works with all types of desktop GUIs, including virtualized and web, and is open source under the MIT license.
wanda
Official PyTorch implementation of Wanda (Pruning by Weights and Activations), a simple and effective pruning approach for large language models. The pruning approach removes weights on a per-output basis, by the product of weight magnitudes and input activation norms. The repository provides support for various features such as LLaMA-2, ablation study on OBS weight update, zero-shot evaluation, and speedup evaluation. Users can replicate main results from the paper using provided bash commands. The tool aims to enhance the efficiency and performance of language models through structured and unstructured sparsity techniques.
llmgraph
llmgraph is a tool that enables users to create knowledge graphs in GraphML, GEXF, and HTML formats by extracting world knowledge from large language models (LLMs) like ChatGPT. It supports various entity types and relationships, offers cache support for efficient graph growth, and provides insights into LLM costs. Users can customize the model used and interact with different LLM providers. The tool allows users to generate interactive graphs based on a specified entity type and Wikipedia link, making it a valuable resource for knowledge graph creation and exploration.
stable-diffusion.cpp
The stable-diffusion.cpp repository provides an implementation for inferring stable diffusion in pure C/C++. It offers features such as support for different versions of stable diffusion, lightweight and dependency-free implementation, various quantization support, memory-efficient CPU inference, GPU acceleration, and more. Users can download the built executable program or build it manually. The repository also includes instructions for downloading weights, building from scratch, using different acceleration methods, running the tool, converting weights, and utilizing various features like Flash Attention, ESRGAN upscaling, PhotoMaker support, and more. Additionally, it mentions future TODOs and provides information on memory requirements, bindings, UIs, contributors, and references.
BentoML
BentoML is an open-source model serving library for building performant and scalable AI applications with Python. It comes with everything you need for serving optimization, model packaging, and production deployment.
Fira
Fira is a memory-efficient training framework for Large Language Models (LLMs) that enables full-rank training under low-rank constraint. It introduces a method for training with full-rank gradients of full-rank weights, achieved with just two lines of equations. The framework includes pre-training and fine-tuning functionalities, packaged as a Python library for easy use. Fira utilizes Adam optimizer by default and provides options for weight decay. It supports pre-training LLaMA models on the C4 dataset and fine-tuning LLaMA-7B models on commonsense reasoning tasks.
llm-analysis
llm-analysis is a tool designed for Latency and Memory Analysis of Transformer Models for Training and Inference. It automates the calculation of training or inference latency and memory usage for Large Language Models (LLMs) or Transformers based on specified model, GPU, data type, and parallelism configurations. The tool helps users to experiment with different setups theoretically, understand system performance, and optimize training/inference scenarios. It supports various parallelism schemes, communication methods, activation recomputation options, data types, and fine-tuning strategies. Users can integrate llm-analysis in their code using the `LLMAnalysis` class or use the provided entry point functions for command line interface. The tool provides lower-bound estimations of memory usage and latency, and aims to assist in achieving feasible and optimal setups for training or inference.
chatllm.cpp
ChatLLM.cpp is a pure C++ implementation tool for real-time chatting with RAG on your computer. It supports inference of various models ranging from less than 1B to more than 300B. The tool provides accelerated memory-efficient CPU inference with quantization, optimized KV cache, and parallel computing. It allows streaming generation with a typewriter effect and continuous chatting with virtually unlimited content length. ChatLLM.cpp also offers features like Retrieval Augmented Generation (RAG), LoRA, Python/JavaScript/C bindings, web demo, and more possibilities. Users can clone the repository, quantize models, build the project using make or CMake, and run quantized models for interactive chatting.
For similar tasks
zipnn
ZipNN is a lossless and near-lossless compression library optimized for numbers/tensors in the Foundation Models environment. It automatically prepares data for compression based on its type, allowing users to focus on core tasks without worrying about compression complexities. The library delivers effective compression techniques for different data types and structures, achieving high compression ratios and rates. ZipNN supports various compression methods like ZSTD, lz4, and snappy, and provides ready-made scripts for file compression/decompression. Users can also manually import the package to compress and decompress data. The library offers advanced configuration options for customization and validation tests for different input and compression types.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.