data:image/s3,"s3://crabby-images/74c83/74c83df2ebf176f02fdd6a78b77f5efae33d2d47" alt="SeerAttention"
SeerAttention
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
Stars: 66
data:image/s3,"s3://crabby-images/bc8de/bc8de9ec086ca72a1cae39b987afc33335726504" alt="screenshot"
SeerAttention is a novel trainable sparse attention mechanism that learns intrinsic sparsity patterns directly from LLMs through self-distillation at post-training time. It achieves faster inference while maintaining accuracy for long-context prefilling. The tool offers features such as trainable sparse attention, block-level sparsity, self-distillation, efficient kernel, and easy integration with existing transformer architectures. Users can quickly start using SeerAttention for inference with AttnGate Adapter and training attention gates with self-distillation. The tool provides efficient evaluation methods and encourages contributions from the community.
README:
Official implementation of SeerAttention - a novel trainable sparse attention mechanism that learns intrinsic sparsity patterns directly from LLMs through self-distillation at post-training time. Achieves faster inference while maintaining accuracy for long-context prefilling.
- 2025/2/23: Support Qwen! Change the distillation into model adapter so that only AttnGates are saved.
- 2025/2/18: Deepseek's Native Sparse Attention (NSA) and Kimi's Mixture of Block Attention (MoBA) all aquire similar trainable sparse attention concepts as us for pretrain models. Great works!
Trainable Sparse Attention - Outperform static/predefined attention sparsity
Block-level Sparsity - Hardware efficient sparsity at block level
Self-Distillation - Lightweight training of attention gates (original weights frozen)
Efficient Kernel - Block-sparse FlashAttention implementation
Easy Integration - Works with existing transformer architectures
The current codebase is improved by only saving the distilled AttnGates' weights. During inference, you can composed the AttnGates and original base model. Check the latest huggingface repos!
Base Model | HF Link | AttnGates Size |
---|---|---|
Llama-3.1-8B-Instruct | SeerAttention/SeerAttention-Llama-3.1-8B-AttnGates | 101 MB |
Llama-3.1-70B-Instruct | SeerAttention/SeerAttention-Llama-3.1-70B-AttnGates | 503 MB |
Qwen2.5-7B-Instruct | SeerAttention/SeerAttention-Qwen2.5-7B-AttnGates | 77 MB |
Qwen2.5-14B-Instruct | SeerAttention/SeerAttention-Qwen2.5-14B-AttnGates | 189 MB |
Qwen2.5-32B-Instruct | SeerAttention/SeerAttention-Qwen2.5-32B-AttnGates | 252 MB |
conda create -yn seer python=3.11
conda activate seer
pip install torch==2.4.0
pip install -r requirements.txt
pip install -e .
During inference, we automatically compose your original base model with our distilled AttnGates.
SeerAttention supports two sparse methods (Threshold / TopK) to convert a soft gating score to hard binary attention mask. Currently we simply use a single sparse configuration for all the attention heads. You are encourage to explore other configurations to tradeoff the speedup vs quality.
from transformers import AutoTokenizer, AutoConfig
from seer_attn import SeerAttnLlamaForCausalLM
model_name = "SeerAttention/SeerAttention-Llama-3.1-8B-AttnGates"
config = AutoConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(
config.base_model,
padding_side="left",
)
## This will compose the AttnGates and base model
model = SeerAttnLlamaForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
seerattn_sparsity_method='threshold', # Using a threshold based sparse method
seerattn_threshold = 5e-4, # Higher = sparser, typical range 5e-4 ~ 5e-3
)
# Or using a TopK based sparse method
model = SeerAttnLlamaForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
seerattn_sparsity_method='nz_ratio',
seerattn_nz_ratio = 0.5, # Lower = sparser, typical range 0.1 ~ 0.9
)
model = model.cuda()
# Ready to inference
Only AttnGates are trained to mimic the block-level attention score. In other words, the original model's weights are fronzen.
## scirpts to reproduce llama-3.1-8b
bash run_distillation.sh
The core idea of self-distillation training is to use the 2d-maxpooled attention map from original model to train an AttnGate. We provide an efficient kernel to directly output this ground truth.
### simple pseudo codo for self-distillation AttnGate training
from seer_attn.attn_pooling_kernel import attn_with_pooling
predict_mask = attn_gate(...)
attn_output, mask_ground_truth = attn_with_pooling(
query_states,
key_states,
value_states,
is_causal,
sm_scale,
block_size
)
###...
loss = self.loss_func(predict_mask, mask_ground_truth)
For efficiency, we evaluate block_sparse_attn
compared with full attention by FlashAttention-2.
For model accuracy, we evaluate SeerAttention on PG19, Ruler and LongBench. Please refer to eval
folder for details.
If you find SeerAttention useful or want to use in your projects, please kindly cite our paper:
@article{gao2024seerattention,
title={SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs},
author={Gao, Yizhao and Zeng, Zhichen and Du, Dayou and Cao, Shijie and So, Hayden Kwok-Hay and Cao, Ting and Yang, Fan and Yang, Mao},
journal={arXiv preprint arXiv:2410.13276},
year={2024}
}
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for SeerAttention
Similar Open Source Tools
data:image/s3,"s3://crabby-images/bc8de/bc8de9ec086ca72a1cae39b987afc33335726504" alt="SeerAttention Screenshot"
SeerAttention
SeerAttention is a novel trainable sparse attention mechanism that learns intrinsic sparsity patterns directly from LLMs through self-distillation at post-training time. It achieves faster inference while maintaining accuracy for long-context prefilling. The tool offers features such as trainable sparse attention, block-level sparsity, self-distillation, efficient kernel, and easy integration with existing transformer architectures. Users can quickly start using SeerAttention for inference with AttnGate Adapter and training attention gates with self-distillation. The tool provides efficient evaluation methods and encourages contributions from the community.
data:image/s3,"s3://crabby-images/7127e/7127ee7205a1505290fabcef9143d2446e2678d8" alt="TokenFormer Screenshot"
TokenFormer
TokenFormer is a fully attention-based neural network architecture that leverages tokenized model parameters to enhance architectural flexibility. It aims to maximize the flexibility of neural networks by unifying token-token and token-parameter interactions through the attention mechanism. The architecture allows for incremental model scaling and has shown promising results in language modeling and visual modeling tasks. The codebase is clean, concise, easily readable, state-of-the-art, and relies on minimal dependencies.
data:image/s3,"s3://crabby-images/9885a/9885a51e3bdcd5e5cb91e870b3f472f404dad300" alt="flashinfer Screenshot"
flashinfer
FlashInfer is a library for Language Languages Models that provides high-performance implementation of LLM GPU kernels such as FlashAttention, PageAttention and LoRA. FlashInfer focus on LLM serving and inference, and delivers state-the-art performance across diverse scenarios.
data:image/s3,"s3://crabby-images/e1328/e13284ff422f8de07215c1db76772722c5c2c5fc" alt="LLM-Pruner Screenshot"
LLM-Pruner
LLM-Pruner is a tool for structural pruning of large language models, allowing task-agnostic compression while retaining multi-task solving ability. It supports automatic structural pruning of various LLMs with minimal human effort. The tool is efficient, requiring only 3 minutes for pruning and 3 hours for post-training. Supported LLMs include Llama-3.1, Llama-3, Llama-2, LLaMA, BLOOM, Vicuna, and Baichuan. Updates include support for new LLMs like GQA and BLOOM, as well as fine-tuning results achieving high accuracy. The tool provides step-by-step instructions for pruning, post-training, and evaluation, along with a Gradio interface for text generation. Limitations include issues with generating repetitive or nonsensical tokens in compressed models and manual operations for certain models.
data:image/s3,"s3://crabby-images/36ab0/36ab06e3d59e6fdf95bab5cf013eec9f0f5864b4" alt="raga-llm-hub Screenshot"
raga-llm-hub
Raga LLM Hub is a comprehensive evaluation toolkit for Language and Learning Models (LLMs) with over 100 meticulously designed metrics. It allows developers and organizations to evaluate and compare LLMs effectively, establishing guardrails for LLMs and Retrieval Augmented Generation (RAG) applications. The platform assesses aspects like Relevance & Understanding, Content Quality, Hallucination, Safety & Bias, Context Relevance, Guardrails, and Vulnerability scanning, along with Metric-Based Tests for quantitative analysis. It helps teams identify and fix issues throughout the LLM lifecycle, revolutionizing reliability and trustworthiness.
data:image/s3,"s3://crabby-images/9b259/9b2594a1d1c12d8b44969470f62bda7622d994b9" alt="MInference Screenshot"
MInference
MInference is a tool designed to accelerate pre-filling for long-context Language Models (LLMs) by leveraging dynamic sparse attention. It achieves up to a 10x speedup for pre-filling on an A100 while maintaining accuracy. The tool supports various decoding LLMs, including LLaMA-style models and Phi models, and provides custom kernels for attention computation. MInference is useful for researchers and developers working with large-scale language models who aim to improve efficiency without compromising accuracy.
data:image/s3,"s3://crabby-images/7828b/7828bcc900c4be53ec94f68503aecf1814af3845" alt="LMCache Screenshot"
LMCache
LMCache is a serving engine extension designed to reduce time to first token (TTFT) and increase throughput, particularly in long-context scenarios. It stores key-value caches of reusable texts across different locations like GPU, CPU DRAM, and Local Disk, allowing the reuse of any text in any serving engine instance. By combining LMCache with vLLM, significant delay savings and GPU cycle reduction are achieved in various large language model (LLM) use cases, such as multi-round question answering and retrieval-augmented generation (RAG). LMCache provides integration with the latest vLLM version, offering both online serving and offline inference capabilities. It supports sharing key-value caches across multiple vLLM instances and aims to provide stable support for non-prefix key-value caches along with user and developer documentation.
data:image/s3,"s3://crabby-images/2e502/2e50297696c73b93ec69c82a7276b1a99f267426" alt="NeMo-Curator Screenshot"
NeMo-Curator
NeMo Curator is a GPU-accelerated open-source framework designed for efficient large language model data curation. It provides scalable dataset preparation for tasks like foundation model pretraining, domain-adaptive pretraining, supervised fine-tuning, and parameter-efficient fine-tuning. The library leverages GPUs with Dask and RAPIDS to accelerate data curation, offering customizable and modular interfaces for pipeline expansion and model convergence. Key features include data download, text extraction, quality filtering, deduplication, downstream-task decontamination, distributed data classification, and PII redaction. NeMo Curator is suitable for curating high-quality datasets for large language model training.
data:image/s3,"s3://crabby-images/8a7af/8a7af5c883a44f07a34b5f36956ab9f1283ab309" alt="aimo-progress-prize Screenshot"
aimo-progress-prize
This repository contains the training and inference code needed to replicate the winning solution to the AI Mathematical Olympiad - Progress Prize 1. It consists of fine-tuning DeepSeekMath-Base 7B, high-quality training datasets, a self-consistency decoding algorithm, and carefully chosen validation sets. The training methodology involves Chain of Thought (CoT) and Tool Integrated Reasoning (TIR) training stages. Two datasets, NuminaMath-CoT and NuminaMath-TIR, were used to fine-tune the models. The models were trained using open-source libraries like TRL, PyTorch, vLLM, and DeepSpeed. Post-training quantization to 8-bit precision was done to improve performance on Kaggle's T4 GPUs. The project structure includes scripts for training, quantization, and inference, along with necessary installation instructions and hardware/software specifications.
data:image/s3,"s3://crabby-images/a7fba/a7fbad0cf421d9d3b3f8f91291ec97b916dbe381" alt="RLAIF-V Screenshot"
RLAIF-V
RLAIF-V is a novel framework that aligns MLLMs in a fully open-source paradigm for super GPT-4V trustworthiness. It maximally exploits open-source feedback from high-quality feedback data and online feedback learning algorithm. Notable features include achieving super GPT-4V trustworthiness in both generative and discriminative tasks, using high-quality generalizable feedback data to reduce hallucination of different MLLMs, and exhibiting better learning efficiency and higher performance through iterative alignment.
data:image/s3,"s3://crabby-images/2b52a/2b52a7eda1b874c746d18ae3c08bf1697e5fc6e7" alt="OpenAdapt Screenshot"
OpenAdapt
OpenAdapt is an open-source software adapter between Large Multimodal Models (LMMs) and traditional desktop and web Graphical User Interfaces (GUIs). It aims to automate repetitive GUI workflows by leveraging the power of LMMs. OpenAdapt records user input and screenshots, converts them into tokenized format, and generates synthetic input via transformer model completions. It also analyzes recordings to generate task trees and replay synthetic input to complete tasks. OpenAdapt is model agnostic and generates prompts automatically by learning from human demonstration, ensuring that agents are grounded in existing processes and mitigating hallucinations. It works with all types of desktop GUIs, including virtualized and web, and is open source under the MIT license.
data:image/s3,"s3://crabby-images/6f609/6f609cf8bd5438100a511740d61eb716efe14d06" alt="FATE-LLM Screenshot"
FATE-LLM
FATE-LLM is a framework supporting federated learning for large and small language models. It promotes training efficiency of federated LLMs using Parameter-Efficient methods, protects the IP of LLMs using FedIPR, and ensures data privacy during training and inference through privacy-preserving mechanisms.
data:image/s3,"s3://crabby-images/1b588/1b5885a06b4a445b903a44f47f220168d4681852" alt="habitat-lab Screenshot"
habitat-lab
Habitat-Lab is a modular high-level library for end-to-end development in embodied AI. It is designed to train agents to perform a wide variety of embodied AI tasks in indoor environments, as well as develop agents that can interact with humans in performing these tasks.
data:image/s3,"s3://crabby-images/61b71/61b71614a9fa55d572c4d79000d565869bb1acb6" alt="axoned Screenshot"
axoned
Axone is a public dPoS layer 1 designed for connecting, sharing, and monetizing resources in the AI stack. It is an open network for collaborative AI workflow management compatible with any data, model, or infrastructure, allowing sharing of data, algorithms, storage, compute, APIs, both on-chain and off-chain. The 'axoned' node of the AXONE network is built on Cosmos SDK & Tendermint consensus, enabling companies & individuals to define on-chain rules, share off-chain resources, and create new applications. Validators secure the network by maintaining uptime and staking $AXONE for rewards. The blockchain supports various platforms and follows Semantic Versioning 2.0.0. A docker image is available for quick start, with documentation on querying networks, creating wallets, starting nodes, and joining networks. Development involves Go and Cosmos SDK, with smart contracts deployed on the AXONE blockchain. The project provides a Makefile for building, installing, linting, and testing. Community involvement is encouraged through Discord, open issues, and pull requests.
data:image/s3,"s3://crabby-images/38340/38340afd630767fad46982c17969509f9428513a" alt="OpenLLM Screenshot"
OpenLLM
OpenLLM is a platform that helps developers run any open-source Large Language Models (LLMs) as OpenAI-compatible API endpoints, locally and in the cloud. It supports a wide range of LLMs, provides state-of-the-art serving and inference performance, and simplifies cloud deployment via BentoML. Users can fine-tune, serve, deploy, and monitor any LLMs with ease using OpenLLM. The platform also supports various quantization techniques, serving fine-tuning layers, and multiple runtime implementations. OpenLLM seamlessly integrates with other tools like OpenAI Compatible Endpoints, LlamaIndex, LangChain, and Transformers Agents. It offers deployment options through Docker containers, BentoCloud, and provides a community for collaboration and contributions.
data:image/s3,"s3://crabby-images/fd697/fd697910a72ec8f9103988d44be285db9ba632d4" alt="nous Screenshot"
nous
Nous is an open-source TypeScript platform for autonomous AI agents and LLM based workflows. It aims to automate processes, support requests, review code, assist with refactorings, and more. The platform supports various integrations, multiple LLMs/services, CLI and web interface, human-in-the-loop interactions, flexible deployment options, observability with OpenTelemetry tracing, and specific agents for code editing, software engineering, and code review. It offers advanced features like reasoning/planning, memory and function call history, hierarchical task decomposition, and control-loop function calling options. Nous is designed to be a flexible platform for the TypeScript community to expand and support different use cases and integrations.
For similar tasks
data:image/s3,"s3://crabby-images/bc8de/bc8de9ec086ca72a1cae39b987afc33335726504" alt="SeerAttention Screenshot"
SeerAttention
SeerAttention is a novel trainable sparse attention mechanism that learns intrinsic sparsity patterns directly from LLMs through self-distillation at post-training time. It achieves faster inference while maintaining accuracy for long-context prefilling. The tool offers features such as trainable sparse attention, block-level sparsity, self-distillation, efficient kernel, and easy integration with existing transformer architectures. Users can quickly start using SeerAttention for inference with AttnGate Adapter and training attention gates with self-distillation. The tool provides efficient evaluation methods and encourages contributions from the community.
data:image/s3,"s3://crabby-images/1c63a/1c63a828509b61dad6be09e0b76e635061ea3454" alt="matmulfreellm Screenshot"
matmulfreellm
MatMul-Free LM is a language model architecture that eliminates the need for Matrix Multiplication (MatMul) operations. This repository provides an implementation of MatMul-Free LM that is compatible with the 🤗 Transformers library. It evaluates how the scaling law fits to different parameter models and compares the efficiency of the architecture in leveraging additional compute to improve performance. The repo includes pre-trained models, model implementations compatible with 🤗 Transformers library, and generation examples for text using the 🤗 text generation APIs.
For similar jobs
data:image/s3,"s3://crabby-images/7a828/7a828889d979cbf4be5a04454f679734bb36585f" alt="sweep Screenshot"
sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.
data:image/s3,"s3://crabby-images/cac11/cac1100b7e92d3c9c9529eacfe5a6e8d943d8f57" alt="teams-ai Screenshot"
teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.
data:image/s3,"s3://crabby-images/10f6b/10f6b939c21eecaacb4aeb678159f5a587a20256" alt="ai-guide Screenshot"
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
data:image/s3,"s3://crabby-images/8b8c3/8b8c30180bcfba25fde40a102b6ae98fd35704b8" alt="classifai Screenshot"
classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.
data:image/s3,"s3://crabby-images/c6b52/c6b52a0438e707c19f9dcb358608627496141f31" alt="chatbot-ui Screenshot"
chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.
data:image/s3,"s3://crabby-images/2fa15/2fa15d62e208bea0a119405a82ad37a6b24564c0" alt="BricksLLM Screenshot"
BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students
data:image/s3,"s3://crabby-images/e597e/e597e24a3c2657c376591c1e0da9159b22cd2ff2" alt="uAgents Screenshot"
uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.
data:image/s3,"s3://crabby-images/8ab69/8ab692a869eef895ffca840dda9b43d13f3cf958" alt="griptape Screenshot"
griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.