awesome-ai-efficiency

A curated list of materials on AI efficiency

Stars: 115

Visit

Awesome AI Efficiency is a curated list of resources dedicated to enhancing efficiency in AI systems. The repository covers various topics essential for optimizing AI models and processes, aiming to make AI faster, cheaper, smaller, and greener. It includes topics like quantization, pruning, caching, distillation, factorization, compilation, parameter-efficient fine-tuning, speculative decoding, hardware optimization, training techniques, inference optimization, sustainability strategies, and scalability approaches.

README:

🌟 Awesome AI Efficiency 🌟

A curated list of resources dedicated to enhancing efficiency in AI systems. This repository covers a wide range of topics essential for optimizing AI models and processes, aiming to make AI faster, cheaper, smaller, and greener!

Topics Summary 🎨

Topic	Description	Topics
Quantization	Reducing precision of AI models without loss
Pruning	Removing unnecessary model parameters for efficiency
Caching	Storing computation results for faster reuse
Distillation	Transferring knowledge from a large model to a smaller one
Factorization	Breaking down complex models into simpler, efficient components
Compilation	Optimizing model code for specific hardware and environments
Parameter-Efficient Fine-tuning	Learning a subset of parameters
Speculative Decoding	Decoding with batches
Hardware	Leveraging specialized hardware for faster model execution
Training	Techniques for making model training faster and more efficient
Inference	Optimizing the speed and resource usage during model inference
Sustainability	Strategies to reduce the environmental impact of AI systems
Scalability	Approaches for scaling AI models and infrastructure efficiently

If you find this list helpful, give it a ⭐ on GitHub, share it, and contribute by submitting a pull request or issue!

Facts 📊
Tools 🛠️
Articles 📰
Reports 📈
Research Articles 📄
Blogs 📰
Books 📚
Lectures 🎓
People 🧑‍💻
Organizations 🌍
Contributing 🤝
License 📄

Facts 📊

3-40Wh: Amount of energy consumed for one small to long ChatGPT query (Source, 2025)
1L: Estimated amount of water required for 20-100 ChatGPT queries (Source, 2025)
2 nuclear plants: Number of nuclear plants to constantly work ot generate enough energy if 80M people generate 5 pages per day (Source, 2025)
1 smartphone charge: Amount of energy required to AI generate a couple of images or run a few thousands inference with an LLM (Source, 2024)
>10s: Time requried to generate 1 HD image with Flux on H100 or to generate 100 tokens with Llama 3 on T4 (Source and Source, 2024)
7-10 smartphone charges: Amount of energy required to AI generate one video with Wan 2.1 (Source)
61,848.0x: Difference between the highest and lowest energy use in energy leaderboard for AI models (Source, 2025).
1,300MWh: GPT-3, for example, is estimated to use just under 1,300 megawatt hours (MWh) of electricity; about as much power as consumed annually by 130 US homes (Source, 2024)
800M users/week: Amount of users using ChatGPT per week in 2025 (Source)
1B messages/day: Amount of ChatGPT queries per day in 2025 (Source)
+160%: Expected increase of data center power consumption by 2030 (Source)
x3.8: Hardware acceleration (GPU/TPU) reduces energy consumption by a factor of 3.8 compared with the CPU, for the same task, but also reduces response time by up to 39% (Source)
x18:The carbon footprint of a task can vary by a factor of 18 depending on the model, framework and backend used (Source)

Tools 🛠️

❤️ Pruna ❤️: A package to make AI models faster, smaller, faster, greener by combining compression methods (incl. quantization, pruning, caching, compilation, distillation...) on various hardware.
TensorRT: High-performance deep learning inference library for NVIDIA GPUs.
ONNX: Open Neural Network Exchange format for interoperability among deep learning frameworks.
Code Carbon: Library to track energy and carbon efficiency of various hardware.
LLM Perf: A framework for benchmarking the performance of transformers models with different hardwares, backends and optimizations.
ML.ENERGY Leaderboard: An initiative to benchmark energy efficiency of AI models.
AI Energy Score: An initiative to establish comparable energy efficiency ratings for AI models, helping the industry make informed decisions about sustainability in AI development.
Model Optimization Toolkit: TensorFlow toolkit for optimizing machine learning models for deployment and execution.
Green Coding: LLM service that you can use to prompt most open source models and see the resource usage.
EcoLogits: EcoLogits is a python library that tracks the energy consumption and environmental footprint of using generative AI models through APIs.
Perplexity Kernels: GPU kernels by Perplexity.
Fast Tokenizer: Fast tokenizer is an efficient and optimized tokenizer engine for llm inference serving.
WeightWatcher: WeightWatcher (WW) is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data..
Cockpit: A Practical Debugging Tool for Training Deep Neural Networks.
Electrictiy Map: A live map showing the origin of the electricity in world regions and their CO2 intensity.
MLCA: A tool for machine learning life cycle assessment.
TritonParse: A visualization and analysis tool for Triton IR files, designed to help developers analyze, debug, and understand Triton kernel compilation processes.
Routing on Random Forests: A framework for training and serving LLM based on random forest-based routers, thus allowing to optimize for costs.
LLMCache: An LLM serving engine extension to reduce time-to-first-token and increase throughput, especially under long-context scenarios.
ExLlamaV3: An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs.
FlashDeBERTa: Flash implementation of DeBERTa disentangled attention mechanism.
QuACK: An assortiment of Kernels for GPUs.
Pi-Quant: An assortiment of Kernels for CPUs.
pplx-kernels: An assortiment of Kernels for GPUs.
LMCache: an LLM serving engine extension to reduce TTFT and increase throughput, especially under long-context scenarios, by optimizing the KV caches.
FastWan: a family of video generation models trained via “sparse distillation”.
GEAK Agent: This is an LLM-based multi-agent framework, which can generate functional and efficient gpu kernels automatically.
Fused Kernel Library: Implementation of a package that allows user to define GPU kernel fusion, for non CUDA programmers.

News Articles 📰

"Energy and AI Observatory" (2025) - IEA
"AI’s Impacts, how to limit them, and why" (2025) - Better Tech
"How much energy does ChatGPT use?" (2025) - Epoch AI
"Data centers et intelligence artificielle : la course au gigantisme" (2025) - Le Monde
"What's the environmental cost of AI?" (2024) - CO2 AI
"Shrinking the giants: Paving the way for TinyAI" (2024) - Cell Press
"DeepSeek might not be such good news for energy after all" (2024) - MIT Technology Review
"AI already uses as much energy as a small country. It’s only the beginning." (2024) - Vox
"Quelle contribution du numérique à la décarbonation ?" (2024) - France Stratégie
"Les promesses de l’IA grevées par un lourd bilan carbone" (2024) - Le Monde
"How much electricity does AI consume?" (2024) - The Verge
"How do I track the direct environmental impact of my own inference and training when working with AI?" (2024) - Blog
"Data center emissions probably 662% higher than big tech claims. Can it keep up the ruse?" (2024) - The Guardian
"Light bulbs have energy ratings — so why can’t AI chatbots?" (2024) - Nature
"The Environmental Impacts of AI -- Primer" (2024) - Hugging Face
"The Climate and Sustainability Implications of Generative AI" (2024) - MIT
"AI's "eye-watering" use of resources could be a hurdle to achieving climate goals, argue experts" (2023) - dezeen
"How coders can help save the planet?" (2023) - Blog
"Reducing the Carbon Footprint of Generative AI" (2023) - Blog
"The MPG of LLMs: Exploring the Energy Efficiency of Generative AI" (2023) - Blog
"Ecologie numérique: L’IA durable, entre vœu pieux et opportunité de marché" (2025) - Libération

Reports 📈

"The environmental impact of local text AI" (2025) - Green Spector
"Misinformation by Omission: The Need for More Environmental Transparency in AI" (2025) - None
"A General Framework for Frugal AI" (2025) - AFNOR
"The 2025 AI Index Report" (2025) - Stanford Human-centered Artificial Intelligence
"Energy and AI" (2025) - International Energy Agency
"Key challenges for the environmental performance of AI" (2025) - French Ministry
"Artificial Intelligence and electricity: A system dynamics approach" (2024) - Schneider
"Notable AI Models" (2025) - Epoch AI
"Powering Artificial Intelligence" (2024) - Deloitte
"Google Sustainability Reports" (2024) - Google
"How much water does AI consume? The public deserves to know" (2023) - OECD
"Measuring the environmental impacts of artificial intelligence compute and applications" (2022) - OECD

Research Articles 📄

Paper	Year	Venue
_{Auditing Prompt Caching in Language Model APIs}	2025	ICML
_{Mind the Gap: A Practical Attack on GGUF Quantization}	2025	ICML
_{Olica: Efficient Structured Pruning of Large Language Models without Retraining}	2025	ICML
_{TESS 2: A Large-Scale Generalist Diffusion Language Model}	2025	None
_{Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference}	2025	None
_{Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential}	2025	None
_{Better & Faster Large Language Models via Multi-token Prediction}	2024	None
_{The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries}	2025	None
_{QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models}	2025	None
_{Fast Video Generation with Sliding Tile Attention}	2025	ICML
_{Quartet: Native FP4 Training Can Be Optimal for Large Language Models}	2025	None
_{How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference}	2025	None
_{MagCache: Fast Video Generation with Magnitude-Aware Cache}	2025	None
_{Compressing Language Models for Specialized Domains}	2025	None
_{Dynamic Chunking for End-to-End Hierarchical Sequence Modeling}	2025	None
_{SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training}	2025	None
_{XAttention: Block Sparse Attention with Antidiagonal Scoring}	2025	ICML
_{Jenga: Effective Memory Management for Serving LLM with Heterogeneity}	2025	None
_{Learning Few-Step Diffusion Models by Trajectory Distribution Matching}	2025	ICCV
_{Radial Attention: O(nlogn) Sparse Attention with Energy Decay for Long Video Generation}	2025	None
_{Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding}	2025	None
_{Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas}	2025	None
_{Mirage: A Multi-Level Superoptimizer for Tensor Programs}	2025	None
_{The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization}	2025	None
_{AB-Cache: Training-Free Acceleration of Diffusion Models via Adams-Bashforth Cached Feature Reuse}	2025	None
_{Hardware-Efficient Attention for Fast Decoding}	2025	None
_{Model-Preserving Adaptive Rounding}	2025	None
_{Frugal AI: Introduction, Concepts, Development and Open Questions}	2025	None
_{Making AI Less “Thirsty”: Uncovering and Addressing the Secret Water Footprint of AI Models}	2025	None
_{Efficient Time Series Processing for Transformers and State-Space Models through Token Merging}	2025	None
_{A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency}	2025	None
_{SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference}	2025	None
_{s1: Simple test-time scaling}	2025	None
_{BitNet b1.58 2B4T Technical Report}	2025	None
_{NdLinear Is All You Need for Representation Learning}	2025	None
_{LoRI: Reducing Cross-Task Interference in Multi-Task LowRank Adaptation}	2025	ICLR
_{FISH-Tuning: Enhancing PEFT Methods with Fisher Information}	2025	None
_{Green Prompting}	2025	None
_{Compression Scaling Laws:Unifying Sparsity and Quantization}	2025	None
_{FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality}	2025	ICLR
_{LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding}	2025	ICLR
_{Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models}	2025	None
_{Real-Time Video Generation with Pyramid Attention Broadcast}	2025	ICLR
_{Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-Image Diffusion Models}	2025	ICLR
_{Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing}	2025	ICLR
_{Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention}	2025	None
_{FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute}	2025	None
_{Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling}	2025	None
_{SpinQuant: LLM Quantization with Learned Rotations}	2025	ICLR
_{Making AI Less “Thirsty”: Uncovering and Addressing the Secret Water Footprint of AI Models}	2025	None
_{Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps}	2025	None
_{QuEST: Stable Training of LLMs with 1-Bit Weights and Activations}	2025	None
_{Distillation Scaling Laws}	2025	None
_{From Efficiency Gains to Rebound Effects: The Problem of Jevons' Paradox in AI's Polarized Environmental Debate}	2025	None
_{Coca4ai: checking energy behaviors on AI data centers}	2024	None
_{Scaling up Masked Diffusion Models on Text}	2024	None
_{LTX-Video: Realtime Video Latent Diffusion}	2024	CVPR
_{How Green Can AI Be? A Study of Trends in Machine Learning Environmental Impacts}	2024	None
_{QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs}	2024	NeurIPS
_{The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information}	2024	NeurIPS
_{Palu: Compressing KV-Cache with Low-Rank Projection}	2024	None
_{AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration}	2024	MLSys
_{LOFIT: Localized Fine-tuning on LLM Representations}	2024	NeurIPS
_{Outlier Weighed Layerwise Sparsity: A Missing Secret Sauce for Pruning LLMs to High Sparsity}	2024	ICML
_{FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality}	2024	None
_{QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks}	2024	ICML
_{QTIP: Quantization with Trellises and Incoherence Processing}	2024	NeurIPS
_{VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models}	2024	EMNLP
_{QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs}	2024	NeurIPS
_{QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving}	2024	None
_{Extreme Compression of Large Language Models via Additive Quantization}	2024	ICML
_{Fast Matrix Multiplications for Lookup Table-Quantized LLMs}	2024	None
_{GPTVQ: The Blessing of Dimensionality for LLM Quantization}	2024	None
_{Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey}	2024	None
_{SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration}	2024	None
_{SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices}	2024	NeurIPS
_{ShortGPT: Layers in Large Language Models are More Redundant Than You Expecthttps://arxiv.org/pdf/2403.03853}	2024	None
_{Canvas: End-to-End Kernel Architecture Search in Neural Networks}	2024	None
_{Scaling Laws for Precision}	2024	None
_{DeepCache: Accelerating Diffusion Models for Free}	2024	CVPR
_{Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding}	2024	ACL
_{Power Hungry Processing: Watts Driving the Cost of AI Deployment?}	2024	FaccT
_{Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression}	2024	ICML
_{Pushing the Limits of Large Language Model Quantization via the Linearity Theorem}	2024	None
_{Position: Tensor Networks are a Valuable Asset for Green AI}	2024	None
_{Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI}	2024	None
_{Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes}	2024	ICLR
_{Efficient Memory Management for Large Language Model Serving with PagedAttention}	2023	SOSP
_{Broken Neural Scaling Laws}	2023	ICLR
_{Post Training Mixed Precision Quantization of Neural Networks using First-Order Information}	2023	ICCV
_{Ring Attention with Blockwise Transformers for Near-Infinite Context}	2023	None
_{A Practical Mixed Precision Algorithm for Post-Training Quantization}	2023	None
_{SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models}	2023	ICML
_{PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs}	2023	None
_{Trends in AI inference energy consumption: Beyond the performance-vs-parameter laws of deep learning}	2023	Sustainable Computing: Informatics and Systems
_{An experimental comparison of software-based power meters: focus on CPU and GPU}	2023	CCGrid
_{Fast Inference from Transformers via Speculative Decoding}	2023	ICML
_{Efficient Streaming Language Models with Attention Sinks}	2023	ICLR
_{GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers}	2023	None
_{Mixed-Precision Neural Network Quantization via Learned Layer-wise Importance}	2022	ECCV
_{Knowledge Distillation: A Good Teacher is Patient and Consistent}	2022	CVPR
_{LoRA: Low-Rank Adaptation of Large Language Models}	2022	ICLR
_{LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale}	2022	NeurIPS
_{Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training}	2022	ICML
_{Sustainable AI: Environmental Implications, Challenges and Opportunities}	2022	None
_{Learnable Lookup Table for Neural Network Quantization}	2022	CVPR
_{Training Compute-Optimal Large Language Models}	2022	None
_{FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness}	2022	None
_{Towards a Unified View of Parameter-Efficient Transfer Learning}	2022	ICLR
_{Parameter-Efficient Transfer Learning with Diff Pruning}	2021	ACL
_{What is the State of Neural Network Pruning?}	2020	MLSys
_{Scaling Laws for Autoregressive Generative Modeling}	2020	None
_{Model Compression via Distillation and Quantization}	2018	ICLR
_{Optimal Brain Damage}	1989	NeurIPs

Blogs 📰

"Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B (2025)" - Hazy Research
"Our contribution to a global environmental standard for AI (2025)" - Mistral AI
"AI: It's All About Inference Now (2025)" - ACM Queue
"ScalarLM vLLM Optimization with Virtual Channels" (2025) - ScalarLM
"Review of Inference Optimization" (2025) - Aussie AI
"The Limits of Large Fused Kernels on Nvidia GPUs: Why Real-Time AI Inference Needs More" (2025) - Smallest AI
"How Much Power does a SOTA Open Video Model Use?" (2025) - Hugging Face
"Improving Quantized FP4 Weight Quality via Logit Distillation" (2025) - Mobius Labs
"Introducing NVFP4 for Efficient and Accurate Low-Precision Inference" (2025) - Nvidia
"The LLM Engineer Almanac" (2025) - Modal
"Enhance Your Models in 5 Minutes with the Hugging Face Kernel Hub" (2025) - Hugging Face
"Reduce, Reuse, Recycle: Why Open Source is a Win for Sustainability" (2025) - Hugging Face
"Mixture of Experts: When Does It Really Deliver Energy Efficiency?" (2025) - Neural Watt
"Efficient and Portable Mixture-of-Experts Communication" (2025) - Perplexity
"Optimizing Tokenization for Faster and Efficient LLM Processing" (2025) - Medium
"Tensor Parallelism with CUDA - Multi-GPU Matrix Multiplication" (2025) - Substack
"Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling" (2025) - Nvidia Developer
"AI CUDA Engineer" (2025) - Sakana AI
"The ML/AI Engineer's starter guide to GPU Programming" (2025) - Neural Bits
"Understanding Quantization for LLMs" (2024) - Medium
"Don't Merge Your LoRA Adapter Into a 4-bit LLM" (2023) - Substack
"Matrix Multiplication Background User's Guide" (2023) - Nvidia Developer
"GPU Performance Background User's Guide" (2023) - Nvidia Developer

Books 📚

Programming Massively Parallel Processors: A Hands-on Approach (2022), Wen-mei W. Hwu, David B. Kirk, Izzat El Hajj
Efficient Deep Learning (2022), Gaurav Menghani, Naresh Singh

Lectures 🎓

AI Efficiency Courses: Slides, Exercises (2025) - Lecture by Bertrand Charpentier
Data Compression, Theory and Applications: YouTube, Slides (2024) - Stanford
MIT Han's Lab (2024) - MIT Lecture by Han's lab
GPU Mode (2020) - Tutorials by GPU mode community

People 🧑‍💻

Name	Affiliation	Research Interests
James Martin	Better Tech	_{AI Sustainability}
Saleh Ashkboos	ETH Zurich	_Quantization
Dan Alistarh	IST Austria	_{AI Compression}
Elias Frantar	OpenAI	_Quantization
Tim Dettmers	CMU	_Quantization
Song Han	MIT	_{AI Efficiency}
Scott Chamberlin	TBD	_{AI Efficiency}
Benoit Petit	Boavista	_{Data Center Efficiency}
Samuel Rincé	Gen AI Impact	_{AI Efficiency, Sustainability}
Théo Alves Da Costa	Ekimetrics	_{AI Efficiency, Sustainability}
Sasha Luccioni	Hugging Face	_{AI Sustainability}
Anne-Laure Ligozat	ENSIEE	_{AI Sustainability}
Boris Gamazaychikov	Sales Force	_{AI Sustainability}
Julie Ravillon	Sales Force	_{AI Sustainability}
Will Alpine	Enabled EMissions Camapaigns	_{AI Sustainability}
Holly Alpine	Enabled EMissions Camapaigns	_{AI Sustainability}
Drew Wilkinson	Climate Leadership Collective	_{AI Sustainability}
Maren Costa	WorkforClimate	_{AI Sustainability}
Lou Welgryn	Data4Good	_{AI Ethics, Sustainability}
Caroline Jean-Pierre	Gen AI Impact	_{AI Sustainability}
Claire Saignol	Gen AI Impact	_{AI Sustianability}
Juliette Fropier	French Ministry	_{AI Sustainability}
Helene Costa de Beauregard	French Ministry	_{AI Sustainability}
Rémy Marrone	Independent	_{AI Sustainability}
Mark Butcher	Positive Cloud	_{Cloud sustainability}
Robert Keus	Green PT	_{AI Sustainability}
Cas Burggraaf	GreenPT	_{AI Sustainability}
Wilco Burggraaf	GreenPT	_{AI Sustainability}
Anna Lerner Nesbitt	Climate Collective	_{AI Sustainability}
Scott Chamberlin	Neural Watt	_{AI Sustainability}
Jeremy Tamanini	Dual CItizen LLC	_{AI Sustainability}
Emma Strubell	CMU	_{AI Sustainability}

Organizations 🌍

Organization	Description	Website
Data4Good	A platform that connects data scientists with social impact projects to address global challenges using data.	data4good.org
Gen AI Impact	A platform dedidaceted to understand generative AI environmental footprint.	genai-impact.org
Make.org	A global platform that empowers citizens to propose and take action on social and environmental issues through collective projects.	make.org
CodeCarbon	A tool that helps track the carbon emissions of machine learning models and optimizes them for sustainability.	codecarbon.io
Sustainable AI Coalition	An organization dedicated to advancing sustainability in AI technologies and promoting best practices for green AI.	sustainableaicoalition.org
FruitPunch AI	A community that solves AI solutions for impact organizations that contribute to the SDG's.	fruitpunch.ai

Contributing 🤝

Contributions are welcome! Please follow our contribution guidelines to add new resources or suggest improvements that promote AI efficiency. Youc can contact @sharpenb if you have any questions.

License 📄

This project is licensed under the MIT License. Feel free to share and use the resources as needed.

For Tasks:

Click tags to check more tools for each tasks

optimize model training reduce model size improve inference speed enhance hardware utilization minimize environmental impact

For Jobs:

ai engineer machine learning engineer data scientist ai researcher sustainability analyst

Alternative AI tools for awesome-ai-efficiency

Similar Open Source Tools

awesome-ai-efficiency

github

: 115

Awesome_Multimodel_LLM

github

: 231

Open-dLLM

Open-dLLM is the most open release of a diffusion-based large language model, providing pretraining, evaluation, inference, and checkpoints. It introduces Open-dCoder, the code-generation variant of Open-dLLM. The repo offers a complete stack for diffusion LLMs, enabling users to go from raw data to training, checkpoints, evaluation, and inference in one place. It includes pretraining pipeline with open datasets, inference scripts for easy sampling and generation, evaluation suite with various metrics, weights and checkpoints on Hugging Face, and transparent configs for full reproducibility.

github

: 237

END-TO-END-GENERATIVE-AI-PROJECTS

The 'END TO END GENERATIVE AI PROJECTS' repository is a collection of awesome industry projects utilizing Large Language Models (LLM) for various tasks such as chat applications with PDFs, image to speech generation, video transcribing and summarizing, resume tracking, text to SQL conversion, invoice extraction, medical chatbot, financial stock analysis, and more. The projects showcase the deployment of LLM models like Google Gemini Pro, HuggingFace Models, OpenAI GPT, and technologies such as Langchain, Streamlit, LLaMA2, LLaMAindex, and more. The repository aims to provide end-to-end solutions for different AI applications.

github

: 145

FaceAISDK_Android

FaceAI SDK is an on-device offline face detection, recognition, liveness detection, anti-spoofing, and 1:N/M:N face search SDK. It enables quick integration to achieve on-device face recognition, face search, and other functions. The SDK performs all functions offline on the device without the need for internet connection, ensuring privacy and security. It supports various actions for liveness detection, custom camera management, and clear imaging even in challenging lighting conditions.

github

: 875

awesome-LangGraph

Awesome LangGraph is a curated list of projects, resources, and tools for building stateful, multi-actor applications with LangGraph. It provides valuable resources for developers at all stages of development, from beginners to those building production-ready systems. The repository covers core ecosystem components, LangChain ecosystem, LangGraph platform, official resources, starter templates, pre-built agents, example applications, development tools, community projects, AI assistants, content & media, knowledge & retrieval, finance & business, sustainability, learning resources, companies using LangGraph, contributing guidelines, and acknowledgments.

github

: 1.1k

HaE

HaE is a framework project in the field of network security (data security) that combines artificial intelligence (AI) large models to achieve highlighting and information extraction of HTTP messages (including WebSocket). It aims to reduce testing time, focus on valuable and meaningful messages, and improve vulnerability discovery efficiency. The project provides a clear and visual interface design, simple interface interaction, and centralized data panel for querying and extracting information. It also features built-in color upgrade algorithm, one-click export/import of data, and integration of AI large models API for optimized data processing.

github

: 2.7k

Nocode-Wep

Nocode/WEP is a forward-looking office visualization platform that includes modules for document building, web application creation, presentation design, and AI capabilities for office scenarios. It supports features such as configuring bullet comments, global article comments, multimedia content, custom drawing boards, flowchart editor, form designer, keyword annotations, article statistics, custom appreciation settings, JSON import/export, content block copying, and unlimited hierarchical directories. The platform is compatible with major browsers and aims to deliver content value, iterate products, share technology, and promote open-source collaboration.

github

: 143

happy-llm

Happy-LLM is a systematic learning tutorial for Large Language Models (LLM) that covers NLP research methods, LLM architecture, training process, and practical applications. It aims to help readers understand the principles and training processes of large language models. The tutorial delves into Transformer architecture, attention mechanisms, pre-training language models, building LLMs, training processes, and practical applications like RAG and Agent technologies. It is suitable for students, researchers, and LLM enthusiasts with programming experience, Python knowledge, and familiarity with deep learning and NLP concepts. The tutorial encourages hands-on practice and participation in LLM projects and competitions to deepen understanding and contribute to the open-source LLM community.

github

: 17.4k

Awesome-LLM-RAG-Application

Awesome-LLM-RAG-Application is a repository that provides resources and information about applications based on Large Language Models (LLM) with Retrieval-Augmented Generation (RAG) pattern. It includes a survey paper, GitHub repo, and guides on advanced RAG techniques. The repository covers various aspects of RAG, including academic papers, evaluation benchmarks, downstream tasks, tools, and technologies. It also explores different frameworks, preprocessing tools, routing mechanisms, evaluation frameworks, embeddings, security guardrails, prompting tools, SQL enhancements, LLM deployment, observability tools, and more. The repository aims to offer comprehensive knowledge on RAG for readers interested in exploring and implementing LLM-based systems and products.

github

: 1.5k

LlamaV-o1

LlamaV-o1 is a Large Multimodal Model designed for spontaneous reasoning tasks. It outperforms various existing models on multimodal reasoning benchmarks. The project includes a Step-by-Step Visual Reasoning Benchmark, a novel evaluation metric, and a combined Multi-Step Curriculum Learning and Beam Search Approach. The model achieves superior performance in complex multi-step visual reasoning tasks in terms of accuracy and efficiency.

github

: 215

Step-DPO

Step-DPO is a method for enhancing long-chain reasoning ability of LLMs with a data construction pipeline creating a high-quality dataset. It significantly improves performance on math and GSM8K tasks with minimal data and training steps. The tool fine-tunes pre-trained models like Qwen2-7B-Instruct with Step-DPO, achieving superior results compared to other models. It provides scripts for training, evaluation, and deployment, along with examples and acknowledgements.

github

: 155

vlmrun-cookbook

VLM Run Cookbook is a repository containing practical examples and tutorials for extracting structured data from images, videos, and documents using Vision Language Models (VLMs). It offers comprehensive Colab notebooks demonstrating real-world applications of VLM Run, with complete code and documentation for easy adaptation. The examples cover various domains such as financial documents and TV news analysis.

github

: 259

AI0x0.com

AI 0x0 is a versatile AI query generation desktop floating assistant application that supports MacOS and Windows. It allows users to utilize AI capabilities in any desktop software to query and generate text, images, audio, and video data, helping them work more efficiently. The application features a dynamic desktop floating ball, floating dialogue bubbles, customizable presets, conversation bookmarking, preset packages, network acceleration, query mode, input mode, mouse navigation, deep customization of ChatGPT Next Web, support for full-format libraries, online search, voice broadcasting, voice recognition, voice assistant, application plugins, multi-model support, online text and image generation, image recognition, frosted glass interface, light and dark theme adaptation for each language model, and free access to all language models except Chat0x0 with a key.

github

: 3.5k

nndeploy

nndeploy is a tool that allows you to quickly build your visual AI workflow without the need for frontend technology. It provides ready-to-use algorithm nodes for non-AI programmers, including large language models, Stable Diffusion, object detection, image segmentation, etc. The workflow can be exported as a JSON configuration file, supporting Python/C++ API for direct loading and running, deployment on cloud servers, desktops, mobile devices, edge devices, and more. The framework includes mainstream high-performance inference engines and deep optimization strategies to help you transform your workflow into enterprise-level production applications.

github

: 1.2k

ipex-llm

The `ipex-llm` repository is an LLM acceleration library designed for Intel GPU, NPU, and CPU. It provides seamless integration with various models and tools like llama.cpp, Ollama, HuggingFace transformers, LangChain, LlamaIndex, vLLM, Text-Generation-WebUI, DeepSpeed-AutoTP, FastChat, Axolotl, and more. The library offers optimizations for over 70 models, XPU acceleration, and support for low-bit (FP8/FP6/FP4/INT4) operations. Users can run different models on Intel GPUs, NPU, and CPUs with support for various features like finetuning, inference, serving, and benchmarking.

github

: 7.6k

For similar tasks

awesome-ai-efficiency

github

: 115

Awesome-Resource-Efficient-LLM-Papers

A curated list of high-quality papers on resource-efficient Large Language Models (LLMs) with a focus on various aspects such as architecture design, pre-training, fine-tuning, inference, system design, and evaluation metrics. The repository covers topics like efficient transformer architectures, non-transformer architectures, memory efficiency, data efficiency, model compression, dynamic acceleration, deployment optimization, support infrastructure, and other related systems. It also provides detailed information on computation metrics, memory metrics, energy metrics, financial cost metrics, network communication metrics, and other metrics relevant to resource-efficient LLMs. The repository includes benchmarks for evaluating the efficiency of NLP models and references for further reading.

github

: 105

aimet

AIMET is a library that provides advanced model quantization and compression techniques for trained neural network models. It provides features that have been proven to improve run-time performance of deep learning neural network models with lower compute and memory requirements and minimal impact to task accuracy. AIMET is designed to work with PyTorch, TensorFlow and ONNX models. We also host the AIMET Model Zoo - a collection of popular neural network models optimized for 8-bit inference. We also provide recipes for users to quantize floating point models using AIMET.

github

: 2.5k

neural-compressor

Intel® Neural Compressor is an open-source Python library that supports popular model compression techniques such as quantization, pruning (sparsity), distillation, and neural architecture search on mainstream frameworks such as TensorFlow, PyTorch, ONNX Runtime, and MXNet. It provides key features, typical examples, and open collaborations, including support for a wide range of Intel hardware, validation of popular LLMs, and collaboration with cloud marketplaces, software platforms, and open AI ecosystems.

github

: 2.5k

Awesome-LLM-Prune

This repository is dedicated to the pruning of large language models (LLMs). It aims to serve as a comprehensive resource for researchers and practitioners interested in the efficient reduction of model size while maintaining or enhancing performance. The repository contains various papers, summaries, and links related to different pruning approaches for LLMs, along with author information and publication details. It covers a wide range of topics such as structured pruning, unstructured pruning, semi-structured pruning, and benchmarking methods. Researchers and practitioners can explore different pruning techniques, understand their implications, and access relevant resources for further study and implementation.

github

: 262

For similar jobs

sweep

Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

github

: 7.1k

teams-ai

The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

github

: 502

ai-guide

This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

github

: 159

classifai

Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

github

: 668

chatbot-ui

Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

github

: 27.7k

BricksLLM

BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

github

: 953

uAgents

uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

github

: 1.3k

griptape

Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.

github

: 2.2k

awesome-ai-efficiency

README:

🌟 Awesome AI Efficiency 🌟

Topics Summary 🎨

Table of Contents

Facts 📊

Tools 🛠️

News Articles 📰

Reports 📈

Research Articles 📄

Blogs 📰

Books 📚

Lectures 🎓

People 🧑‍💻

Organizations 🌍

Contributing 🤝

License 📄

For Tasks:

For Jobs:

Alternative AI tools for awesome-ai-efficiency

Similar Open Source Tools

awesome-ai-efficiency

Awesome_Multimodel_LLM

Open-dLLM

END-TO-END-GENERATIVE-AI-PROJECTS

FaceAISDK_Android

awesome-LangGraph

HaE

Nocode-Wep

happy-llm

Awesome-LLM-RAG-Application

LlamaV-o1

Step-DPO

vlmrun-cookbook

AI0x0.com

nndeploy

ipex-llm

For similar tasks

awesome-ai-efficiency

Awesome-Resource-Efficient-LLM-Papers

aimet

neural-compressor

Awesome-LLM-Prune

For similar jobs

sweep

teams-ai

ai-guide

classifai

chatbot-ui

BricksLLM

uAgents

griptape