Awesome-Resource-Efficient-LLM-Papers

a curated list of high-quality papers on resource-efficient LLMs 🌱

Stars: 105

Visit

A curated list of high-quality papers on resource-efficient Large Language Models (LLMs) with a focus on various aspects such as architecture design, pre-training, fine-tuning, inference, system design, and evaluation metrics. The repository covers topics like efficient transformer architectures, non-transformer architectures, memory efficiency, data efficiency, model compression, dynamic acceleration, deployment optimization, support infrastructure, and other related systems. It also provides detailed information on computation metrics, memory metrics, energy metrics, financial cost metrics, network communication metrics, and other metrics relevant to resource-efficient LLMs. The repository includes benchmarks for evaluating the efficiency of NLP models and references for further reading.

README:

Awesome Resource-Efficient LLM Papers

A curated list of high-quality papers on resource-efficient LLMs. (constantly updated)

This is the GitHub repo for our survey paper Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models.

⭐ If you find our repo helpful, please consider starring ⭐ it—we’d really appreciate your support! ⭐

Awesome Resource-Efficient LLM Papers

LLM Architecture Design

Efficient Transformer Architecture

Date	Keywords	Paper	Venue
2025	Approximate attention	S2-Attention: Hardware-Aware Context Sharding Among Attention Heads	ArXiv
2025	Hardware optimization	Explore Activation Sparsity in Recurrent LLMs for Energy-Efficient Neuromorphic Computing	AICAS
2024	Approximate attention	Simple linear attention language models balance the recall-throughput tradeoff	ArXiv
2024	Hardware optimization	MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases	ArXiv
2024	Approximate attention	LoMA: Lossless Compressed Memory Attention	ArXiv
2024	Approximate attention	Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation	ICML
2024	Hardware optimization	FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning	ICLR
2023	Hardware optimization	Flashattention: Fast and memory-efficient exact attention with io-awareness	NeurIPS
2023	Approximate attention	KDEformer: Accelerating Transformers via Kernel Density Estimation	ICML
2023	Approximate attention	Mega: Moving Average Equipped Gated Attention	ICLR
2022	Hardware optimization	xFormers - Toolbox to Accelerate Research on Transformers	GitHub
2021	Approximate attention	Efficient attention: Attention with linear complexities	WACV
2021	Approximate attention	An Attention Free Transformer	ArXiv
2021	Approximate attention	Self-attention Does Not Need O(n^2) Memory	ArXiv
2021	Hardware optimization	LightSeq: A High Performance Inference Library for Transformers	NAACL
2021	Hardware optimization	FasterTransformer: A Faster Transformer Framework	GitHub
2020	Approximate attention	Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention	ICML
2019	Approximate attention	Reformer: The efficient transformer	ICLR

Non-transformer Architecture

Date	Keywords	Paper	Venue
2024	Decoder	You Only Cache Once: Decoder-Decoder Architectures for Language Models	ArXiv
2024	BitLinear layer	Scalable MatMul-free Language Modeling	ArXiv
2023	RNN LM	RWKV: Reinventing RNNs for the Transformer Era	EMNLP-Findings
2023	MLP	Auto-Regressive Next-Token Predictors are Universal Learners	ArXiv
2023	Convolutional LM	Hyena Hierarchy: Towards Larger Convolutional Language models	ICML
2023	Sub-quadratic Matrices based	Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture	NeurIPS
2023	Selective State Space Model	Mamba: Linear-Time Sequence Modeling with Selective State Spaces	ArXiv
2022	Mixture of Experts	Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity	JMLR
2022	Mixture of Experts	GLaM: Efficient Scaling of Language Models with Mixture-of-Experts	ICML
2022	Mixture of Experts	Mixture-of-Experts with Expert Choice Routing	NeurIPS
2022	Mixture of Experts	Efficient Large Scale Language Modeling with Mixtures of Experts	EMNLP
2017	Mixture of Experts	Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer	ICLR

LLM Pre-Training

Memory Efficiency

Distributed Training

Date	Keywords	Paper	Venue
2025	Data Parallelism	DreamDDP: Accelerating Data Parallel Distributed LLM Training with Layer-wise Scheduled Partial Synchronization	Arxiv
2024	Model Parallelism	Ladder-Residual: Parallelism-Aware Architecture for Accelerating Large Model Inference with Communication Overlapping	Arxiv
2024	Model Efficiency	Optimizing Distributed Training on Frontier for Large Language Models	ISC
2024	Model Parallelism	ProTrain: Efficient LLM Training via Adaptive Memory Management	Arxiv
2024	Model Parallelism	MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs	Arxiv
2023	Data Parallelism	Palm: Scaling language modeling with pathways	Github
2023	Model Parallelism	Bpipe: memory-balanced pipeline parallelism for training large language models	JMLR
2022	Model Parallelism	Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning	OSDI
2021	Data Parallelism	FairScale: A general purpose modular PyTorch library for high performance and large scale training	JMLR
2020	Data Parallelism	Zero: Memory optimizations toward training trillion parameter models	IEEE SC20
2019	Model Parallelism	GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism	NeurIPS
2019	Model Parallelism	Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism	Arxiv
2019	Model Parallelism	PipeDream: generalized pipeline parallelism for DNN training	SOSP
2018	Model Parallelism	Mesh-tensorflow: Deep learning for supercomputers	NeurIPS

Mixed precision training

Date	Keywords	Paper	Venue
2025	Mixed Precision Training	Towards Efficient Pre-training: Exploring FP4 Precision in Large Language Models	Arxiv
2024	Mixed Precision Training	FP8-LM: Training FP8 Large Language Models	Arxiv
2022	Mixed Precision Training	BLOOM: A 176B-Parameter Open-Access Multilingual Language Model	Arxiv
2018	Mixed Precision Training	Bert: Pre-training of deep bidirectional transformers for language understanding	ACL
2017	Mixed Precision Training	Mixed Precision Training	ICLR

Data Efficiency

Importance Sampling

Date	Keywords	Paper	Venue
2024	Importance sampling	How to Train Data-Efficient LLMs	Arxiv
2024	Importance sampling	LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning	Arxiv
2023	Survey on importance sampling	A Survey on Efficient Training of Transformers	IJCAI
2023	Importance sampling	Data-Juicer: A One-Stop Data Processing System for Large Language Models	Arxiv
2023	Importance sampling	INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models	EMNLP
2023	Importance sampling	Machine Learning Force Fields with Data Cost Aware Training	ICML
2022	Importance sampling	Beyond neural scaling laws: beating power law scaling via data pruning	NeurIPS
2021	Importance sampling	Deep Learning on a Data Diet: Finding Important Examples Early in Training	NeurIPS
2018	Importance sampling	Training Deep Models Faster with Robust, Approximate Importance Sampling	NeurIPS
2018	Importance sampling	Not All Samples Are Created Equal: Deep Learning with Importance Sampling	ICML

Data Augmentation

Date	Keywords	Paper	Venue
2024	Data Augmentation	LLMRec: Large Language Models with Graph Augmentation for Recommendation	WSDM
2024	Data augmentation	LLM-DA: Data Augmentation via Large Language Models for Few-Shot Named Entity Recognition	Arxiv
2023	Data augmentation	MixGen: A New Multi-Modal Data Augmentation	WACV
2023	Data augmentation	Augmentation-Aware Self-Supervision for Data-Efficient GAN Training	NeurIPS
2023	Data augmentation	Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis	EMNLP
2023	Data augmentation	FaMeSumm: Investigating and Improving Faithfulness of Medical Summarization	EMNLP

Training Objective

Date	Keywords	Paper	Venue
2023	Training objective	Challenges and Applications of Large Language Models	Arxiv
2023	Training objective	Efficient Data Learning for Open Information Extraction with Pre-trained Language Models	EMNLP
2023	Masked language-image modeling	Scaling Language-Image Pre-training via Masking	CVPR
2022	Masked image modeling	Masked Autoencoders Are Scalable Vision Learners	CVPR
2019	Masked language modeling	MASS: Masked Sequence to Sequence Pre-training for Language Generation	ICML

LLM Fine-Tuning

Parameter-Efficient Fine-Tuning

Date	Keywords	Paper	Venue
2024	LoRA-based fine-tuning	Dlora: Distributed parameter-efficient fine-tuning solution for large language model	Arxiv
2024	LoRA-based fine-tuning	SplitLoRA: A Split Parameter-Efficient Fine-Tuning Framework for Large Language Models	Arxiv
2024	LoRA-based fine-tuning	Data-efficient Fine-tuning for LLM-based Recommendation	SIGIR
2024	LoRA-based fine-tuning	MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter	ACL
2023	LoRA-based fine-tuning	DyLoRA: Parameter-Efficient Tuning of Pretrained Models using Dynamic Search-Free Low Rank Adaptation	EACL
2022	Masking-based fine-tuning	Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively	NeurIPS
2021	Masking-based fine-tuning	BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models	ACL
2021	Masking-based fine-tuning	Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning	EMNLP
2021	Masking-based fine-tuning	Unlearning Bias in Language Models by Partitioning Gradients	ACL
2019	Masking-based fine-tuning	SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization	ACL

Full-Parameter Fine-Tuning

Date	Keywords	Paper	Venue
2025	Full-parameter fine-tuning	BAdam: A Memory Efficient Full Parameter Optimization Method for Large Language Models	NeurIPS
2024	Full-parameter fine-tuning	Hift: A hierarchical full parameter fine-tuning strategy	Arxiv
2024	Study of full-parameter fine-tuning optimizations	A Study of Optimizations for Fine-tuning Large Language Models	Arxiv
2023	Comparative study betweeen full-parameter and LoRA-base fine-tuning	A Comparative Study between Full-Parameter and LoRA-based Fine-Tuning on Chinese Instruction Data for Instruction Following Large Language Model	Arxiv
2023	Comparative study betweeen full-parameter and parameter-efficient fine-tuning	Comparison between parameter-efficient techniques and full fine-tuning: A case study on multilingual news article classification	Arxiv
2023	Full-parameter fine-tuning with limited resources	Full Parameter Fine-tuning for Large Language Models with Limited Resources	Arxiv
2023	Memory-efficient fine-tuning	Fine-Tuning Language Models with Just Forward Passes	NeurIPS
2023	Full-parameter fine-tuning for medicine applications	PMC-LLaMA: Towards Building Open-source Language Models for Medicine	Arxiv
2022	Drawback of full-parameter fine-tuning	Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution	ICLR

LLM Inference

Model Compression

Pruning

Date	Keywords	Paper	Venue
2025	Structured Pruning	SlimGPT: Layer-wise Structured Pruning for Large Language Models	NeurIPS
2024	Unstructured Pruning	SparseLLM: Towards Global Pruning for Pre-trained Language Models	NeurIPS
2024	Structured Pruning	Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models	Arxiv
2024	Structured Pruning	BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation	Arxiv
2024	Structured Pruning	ShortGPT: Layers in Large Language Models are More Redundant Than You Expect	Arxiv
2024	Structured Pruning	NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models	Arxiv
2024	Structured Pruning	SliceGPT: Compress Large Language Models by Deleting Rows and Columns	ICLR
2024	Unstructured Pruning	Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs	ICLR
2024	Structured Pruning	Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models	ICLR
2023	Unstructured Pruning	One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models	Arxiv
2023	Unstructured Pruning	SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot	ICML
2023	Unstructured Pruning	A Simple and Effective Pruning Approach for Large Language Models	ICLR
2023	Unstructured Pruning	AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference With Transformers	TCAD
2023	Structured Pruning	LLM-Pruner: On the Structural Pruning of Large Language Models	NeurIPS
2023	Structured Pruning	LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation	ICML
2023	Structured Pruning	Structured Pruning for Efficient Generative Pre-trained Language Models	ACL
2023	Structured Pruning	ZipLM: Inference-Aware Structured Pruning of Language Models	NeurIPS
2023	Contextual Pruning	Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time	ICML

Quantization

Date	Keywords	Paper	Venue
2025	Weight Quantization	HWPQ: Hessian-free Weight Pruning-Quantization For LLM Compression And Acceleration	Arxiv
2025	Weight Quantization	Huff-LLM: End-to-End Lossless Compression for Efficient LLM Inference	Arxiv
2025	Weight Quantization	AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration	SIGMOBILE
2024	Weight Quantization	Evaluating Quantized Large Language Models	Arxiv
2024	Weight Quantization	I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models	Arxiv
2024	Weight Quantization	ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models	Arxiv
2024	Weight-Activation Co-Quantization	Rotation and Permutation for Advanced Outlier Management and Efficient Quantization of LLMs	NeurIPS
2024	Weight Quantization	OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models	ICLR
2023	Weight Quantization	Flexround: Learnable rounding based on element-wise division for post-training quantization	ICML
2023	Weight Quantization	Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling	EMNLP
2023	Weight Quantization	OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models	AAAI
2023	Weight Quantization	Gptq: Accurate posttraining quantization for generative pre-trained transformers	ICLR
2023	Weight Quantization	Dynamic Stashing Quantization for Efficient Transformer Training	EMNLP
2023	Weight Quantization	Quantization-aware and tensor-compressed training of transformers for natural language understanding	Interspeech
2023	Weight Quantization	QLoRA: Efficient Finetuning of Quantized LLMs	NeurIPS
2023	Weight Quantization	Stable and low-precision training for large-scale vision-language models	NeurIPS
2023	Weight Quantization	Prequant: A task-agnostic quantization approach for pre-trained language models	ACL
2023	Weight Quantization	Olive: Accelerating large language models via hardware-friendly outliervictim pair quantization	ISCA
2023	Weight Quantization	Awq: Activationaware weight quantization for llm compression and acceleration	arXiv
2023	Weight Quantization	Spqr: A sparsequantized representation for near-lossless llm weight compression	arXiv
2023	Weight Quantization	SqueezeLLM: Dense-and-Sparse Quantization	arXiv
2023	Weight Quantization	LLM-QAT: Data-Free Quantization Aware Training for Large Language Models	arXiv
2022	Activation Quantization	Gact: Activation compressed training for generic network architectures	ICML
2022	Fixed-point Quantization	Boost Vision Transformer with GPU-Friendly Sparsity and Quantization	ACL
2021	Activation Quantization	Ac-gc: Lossy activation compression with guaranteed convergence	NeurIPS

Dynamic Acceleration

Input Pruning

Date	Keywords	Paper	Venue
2024	Score-based Token Removal	Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation	COLM
2024	Score-based Token Removal	LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference	Arxiv
2024	Learning-based Token Removal	LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression	ACL
2024	Learning-based Token Removal	Compressed Context Memory For Online Language Model Interaction	ICLR
2023	Score-based Token Removal	Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference	KDD
2023	Learning-based Token Removal	PuMer: Pruning and Merging Tokens for Efficient Vision Language Models	ACL
2023	Learning-based Token Removal	Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model	arXiv
2023	Learning-based Token Removal	SmartTrim: Adaptive Tokens and Parameters Pruning for Efficient Vision-Language Models	arXiv
2022	Learning-based Token Removal	Transkimmer: Transformer Learns to Layer-wise Skim	ACL
2022	Score-based Token Removal	Learned Token Pruning for Transformers	KDD
2021	Learning-based Token Removal	TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference	NAACL
2021	Score-based Token Removal	Efficient sparse attention architecture with cascade token and head pruning	HPCA

System Design

Deployment optimization

Date	Keywords	Paper	Venue
2025	Collaborative inference	Multi-agent Architecture Search via Agentic Supernet	ArXiv
2025	Collaborative inference	MoE²: Optimizing Collaborative Inference for Edge Large Language Models	ACM Transactions on Networking
2025	Collaborative inference	LUK: Empowering Log Understanding with Expert Knowledge from Large Language Models	ArXiv
2025	Hardware optimization	Pushing the Limits of BFP on Narrow Precision LLM Inference	ArXiv
2024	Hardware optimization	LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System	ArXiv
2024	Hardware Optimization	LUT TENSOR CORE: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration	Arxiv
2023	Hardware offloading	FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU	PMLR
2023	Hardware offloading	Fast distributed inference serving for large language models	arXiv
2022	Collaborative inference	Petals: Collaborative Inference and Fine-tuning of Large Models	arXiv
2022	Hardware offloading	DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale	IEEE SC22

Support Infrastructure

Date	Keywords	Paper	Venue
2025	Libraries	LLM Assisted Anomaly Detection Service for Site Reliability Engineers: Enhancing Cloud Infrastructure Resilience	AAAI
2024	Edge devices	MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases	ICML
2024	Edge devices	EdgeShard: Efficient LLM Inference via Collaborative Edge Computing	Arxiv
2024	Edge devices	Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs	ICML
2024	Edge devices	The breakthrough memory solutions for improved performance on llm inference	IEEE Micro
2024	Edge devices	MELTing point: Mobile Evaluation of Language Transformers	MobiCom
2024	Edge devices	LLM as a System Service on Mobile Devices	Arxiv
2024	Edge devices	LocMoE: A Low-overhead MoE for Large Language Model Training	Arxiv
2024	Edge devices	Jetmoe: Reaching llama2 performance with 0.1 m dollars	Arxiv
2023	Edge devices	Training Large-Vocabulary Neural Language Models by Private Federated Learning for Resource-Constrained Devices	ICASSP
2023	Edge devices	Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly	arXiv
2023	Libraries	Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training	ICPP
2023	Libraries	GPT-NeoX-20B: An Open-Source Autoregressive Language Model	ACL
2023	Edge devices	Large Language Models Empowered Autonomous Edge AI for Connected Intelligence	arXiv
2022	Libraries	DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale	IEEE SC22
2022	Libraries	Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning	OSDI
2022	Edge devices	EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation	arXiv
2022	Edge devices	ProFormer: Towards On-Device LSH Projection-Based Transformers	ACL
2021	Edge devices	Generate More Features with Cheap Operations for BERT	ACL
2021	Edge devices	SqueezeBERT: What can computer vision teach NLP about efficient neural networks?	SustaiNLP
2020	Edge devices	Lite Transformer with Long-Short Range Attention	arXiv
2019	Libraries	Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism	IEEE SC22
2018	Libraries	Mesh-TensorFlow: Deep Learning for Supercomputers	NeurIPS

Other Systems

Date	Keywords	Paper	Venue
2023	Other Systems	Tabi: An Efficient Multi-Level Inference System for Large Language Models	EuroSys
2023	Other Systems	Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation	PACMMOD

Resource-Efficiency Evaluation Metrics & Benchmarks

🧮 Computation Metrics

Metric	Description	Example Usage
FLOPs (Floating-point operations)	the number of arithmetic operations on floating-point numbers	[FLOPs]
Training Time	the total duration required for training, typically measured in wall-clock minutes, hours, or days	[minutes, days] [hours]
Inference Time/Latency	the average time required to generate an output after receiving an input, typically measured in wall-clock time or CPU/GPU/TPU clock time in milliseconds or seconds	[end-to-end latency in seconds] [next token generation latency in milliseconds]
Throughput	the rate of output tokens generation or tasks completion, typically measured in tokens per second (TPS) or queries per second (QPS)	[tokens/s] [queries/s]
Speed-Up Ratio	the improvement in inference speed compared to a baseline model	[inference time speed-up] [throughput speed-up]

💾 Memory Metrics

Metric	Description	Example Usage
Number of Parameters	the number of adjustable variables in the LLM’s neural network	[number of parameters]
Model Size	the storage space required for storing the entire model	[peak memory usage in GB]

⚡️ Energy Metrics

Metric	Description	Example Usage
Energy Consumption	the electrical power used during the LLM’s lifecycle	[kWh]
Carbon Emission	the greenhouse gas emissions associated with the model’s energy usage	[kgCO2eq]

The following are available software packages designed for real-time tracking of energy consumption and carbon emissions.

CodeCarbon

Carbontracker

experiment-impact-tracker

You might also find the following helpful for predicting the energy usage and carbon footprint before actual training or

ML CO2 Impact

LLMCarbon

💵 Financial Cost Metric

Metric	Description	Example Usage
Dollars per parameter	the total cost of training (or running) the LLM by the number of parameters

📨 Network Communication Metric

Metric	Description	Example Usage
Communication Volume	the total amount of data transmitted across the network during a specific LLM execution or training run	[communication volume in TB]

💡 Other Metrics

Metric	Description	Example Usage
Compression Ratio	the reduction in the size of the compressed model compared to the original model	[compress rate] [percentage of weights remaining]
Loyalty/Fidelity	the resemblance between the teacher and student models in terms of both prediction consistency and predicted probability distributions alignment	[loyalty] [fidelity]
Robustness	the resistance to adversarial attacks, where slight input modifications can potentially manipulate the model's output	[after-attack accuracy, query number]
Pareto Optimality	the optimal trade-offs between various competing factors	[Pareto frontier (cost and accuracy)] [Pareto frontier (performance and FLOPs)]

Benchmarks

Benchmark	Description	Paper
General NLP Benchmarks	an extensive collection of general NLP benchmarks such as GLUE, SuperGLUE, WMT, and SQuAD, etc.	A Comprehensive Overview of Large Language Models
Dynaboard	an open-source platform for evaluating NLP models in the cloud, offering real-time interaction and a holistic assessment of model quality with customizable Dynascore	Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking
EfficientQA	an open-domain Question Answering (QA) challenge at NeurIPS 2020 that focuses on building accurate, memory-efficient QA systems	NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned
SustaiNLP 2020 Shared Task	a challenge for development of energy-efficient NLP models by assessing their performance across eight NLU tasks using SuperGLUE metrics and evaluating their energy consumption during inference	Overview of the SustaiNLP 2020 Shared Task
ELUE (Efficient Language Understanding Evaluation)	a benchmark platform for evaluating NLP model efficiency across various tasks, offering online metrics and requiring only a Python model definition file for submission	Towards Efficient NLP: A Standard Evaluation and A Strong Baseline
VLUE (Vision-Language Understanding Evaluation)	a comprehensive benchmark for assessing vision-language models across multiple tasks, offering an online platform for evaluation and comparison	VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Long Range Arena (LAG)	a benchmark suite evaluating efficient Transformer models on long-context tasks, spanning diverse modalities and reasoning types while allowing evaluations under controlled resource constraints, highlighting real-world efficiency	Long Range Arena: A Benchmark for Efficient Transformers
Efficiency-aware MS MARCO	an enhanced MS MARCO information retrieval benchmark that integrates efficiency metrics like per-query latency and cost alongside accuracy, facilitating a comprehensive evaluation of IR systems	Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking

Reference

If you find this paper list useful in your research, please consider citing:

@article{bai2024beyond,
  title={Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models},
  author={Bai, Guangji and Chai, Zheng and Ling, Chen and Wang, Shiyu and Lu, Jiaying and Zhang, Nan and Shi, Tingwei and Yu, Ziyang and Zhu, Mengdan and Zhang, Yifei and others},
  journal={arXiv preprint arXiv:2401.00625},
  year={2024}
}

For Tasks:

Click tags to check more tools for each tasks

optimize training improve inference speed reduce memory usage evaluate energy consumption implement efficient system design

For Jobs:

data scientist machine learning engineer research scientist ai architect nlp specialist

Alternative AI tools for Awesome-Resource-Efficient-LLM-Papers

Similar Open Source Tools

Awesome-Resource-Efficient-LLM-Papers

github

: 105

LLM-KG4QA

LLM-KG4QA is a repository focused on the integration of Large Language Models (LLMs) and Knowledge Graphs (KGs) for Question Answering (QA). It covers various aspects such as using KGs as background knowledge, reasoning guideline, and refiner/filter. The repository provides detailed information on pre-training, fine-tuning, and Retrieval Augmented Generation (RAG) techniques for enhancing QA performance. It also explores complex QA tasks like Explainable QA, Multi-Modal QA, Multi-Document QA, Multi-Hop QA, Multi-run and Conversational QA, Temporal QA, Multi-domain and Multilingual QA, along with advanced topics like Optimization and Data Management. Additionally, it includes benchmark datasets, industrial and scientific applications, demos, and related surveys in the field.

github

: 80

AudioLLM

AudioLLMs is a curated collection of research papers focusing on developing, implementing, and evaluating language models for audio data. The repository aims to provide researchers and practitioners with a comprehensive resource to explore the latest advancements in AudioLLMs. It includes models for speech interaction, speech recognition, speech translation, audio generation, and more. Additionally, it covers methodologies like multitask audioLLMs and segment-level Q-Former, as well as evaluation benchmarks like AudioBench and AIR-Bench. Adversarial attacks such as VoiceJailbreak are also discussed.

github

: 71

Awesome-LLM4IE-Papers

github

: 645

Awesome-Agent-Papers

This repository is a comprehensive collection of research papers on Large Language Model (LLM) agents, organized across key categories including agent construction, collaboration mechanisms, evolution, tools, security, benchmarks, and applications. The taxonomy provides a structured framework for understanding the field of LLM agents, bridging fragmented research threads by highlighting connections between agent design principles and emergent behaviors.

github

: 98

Awesome-LLM-Papers-Comprehensive-Topics

github

: 172

Awesome-Model-Merging-Methods-Theories-Applications

A comprehensive repository focusing on 'Model Merging in LLMs, MLLMs, and Beyond', providing an exhaustive overview of model merging methods, theories, applications, and future research directions. The repository covers various advanced methods, applications in foundation models, different machine learning subfields, and tasks like pre-merging methods, architecture transformation, weight alignment, basic merging methods, and more.

github

: 519

LLM4EC

LLM4EC is an interdisciplinary research repository focusing on the intersection of Large Language Models (LLM) and Evolutionary Computation (EC). It provides a comprehensive collection of papers and resources exploring various applications, enhancements, and synergies between LLM and EC. The repository covers topics such as LLM-assisted optimization, EA-based LLM architecture search, and applications in code generation, software engineering, neural architecture search, and other generative tasks. The goal is to facilitate research and development in leveraging LLM and EC for innovative solutions in diverse domains.

github

: 78

Awesome_LLM_System-PaperList

Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on LLMs inference and serving.

github

: 184

ai-game-devtools

github

: 735

speech-trident

Speech Trident is a repository focusing on speech/audio large language models, covering representation learning, neural codec, and language models. It explores speech representation models, speech neural codec models, and speech large language models. The repository includes contributions from various researchers and provides a comprehensive list of speech/audio language models, representation models, and codec models.

github

: 636

Cool-GenAI-Fashion-Papers

Cool-GenAI-Fashion-Papers is a curated list of resources related to GenAI-Fashion, including papers, workshops, companies, and products. It covers a wide range of topics such as fashion design synthesis, outfit recommendation, fashion knowledge extraction, trend analysis, and more. The repository provides valuable insights and resources for researchers, industry professionals, and enthusiasts interested in the intersection of AI and fashion.

github

: 129

ai-game-development-tools

Here we will keep track of the AI Game Development Tools, including LLM, Agent, Code, Writer, Image, Texture, Shader, 3D Model, Animation, Video, Audio, Music, Singing Voice and Analytics. 🔥 * Tool (AI LLM) * Game (Agent) * Code * Framework * Writer * Image * Texture * Shader * 3D Model * Avatar * Animation * Video * Audio * Music * Singing Voice * Speech * Analytics * Video Tool

github

: 312

Github-Ranking-AI

This repository provides a list of the most starred and forked repositories on GitHub. It is updated automatically and includes information such as the project name, number of stars, number of forks, language, number of open issues, description, and last commit date. The repository is divided into two sections: LLM and chatGPT. The LLM section includes repositories related to large language models, while the chatGPT section includes repositories related to the chatGPT chatbot.

github

: 227

LLM4Opt

LLM4Opt is a collection of references and papers focusing on applying Large Language Models (LLMs) for diverse optimization tasks. The repository includes research papers, tutorials, workshops, competitions, and related collections related to LLMs in optimization. It covers a wide range of topics such as algorithm search, code generation, machine learning, science, industry, and more. The goal is to provide a comprehensive resource for researchers and practitioners interested in leveraging LLMs for optimization tasks.

github

: 125

open-llms

Open LLMs is a repository containing various Large Language Models licensed for commercial use. It includes models like T5, GPT-NeoX, UL2, Bloom, Cerebras-GPT, Pythia, Dolly, and more. These models are designed for tasks such as transfer learning, language understanding, chatbot development, code generation, and more. The repository provides information on release dates, checkpoints, papers/blogs, parameters, context length, and licenses for each model. Contributions to the repository are welcome, and it serves as a resource for exploring the capabilities of different language models.

github

: 10.3k

For similar tasks

Awesome-Resource-Efficient-LLM-Papers

github

: 105

awesome-ai-efficiency

Awesome AI Efficiency is a curated list of resources dedicated to enhancing efficiency in AI systems. The repository covers various topics essential for optimizing AI models and processes, aiming to make AI faster, cheaper, smaller, and greener. It includes topics like quantization, pruning, caching, distillation, factorization, compilation, parameter-efficient fine-tuning, speculative decoding, hardware optimization, training techniques, inference optimization, sustainability strategies, and scalability approaches.

github

: 115

llm-finetuning

llm-finetuning is a repository that provides a serverless twist to the popular axolotl fine-tuning library using Modal's serverless infrastructure. It allows users to quickly fine-tune any LLM model with state-of-the-art optimizations like Deepspeed ZeRO, LoRA adapters, Flash attention, and Gradient checkpointing. The repository simplifies the fine-tuning process by not exposing all CLI arguments, instead allowing users to specify options in a config file. It supports efficient training and scaling across multiple GPUs, making it suitable for production-ready fine-tuning jobs.

github

: 483

HighPerfLLMs2024

High Performance LLMs 2024 is a comprehensive course focused on building a high-performance Large Language Model (LLM) from scratch using Jax. The course covers various aspects such as training, inference, roofline analysis, compilation, sharding, profiling, and optimization techniques. Participants will gain a deep understanding of Jax and learn how to design high-performance computing systems that operate close to their physical limits.

github

: 124

LLM-Travel

LLM-Travel is a repository dedicated to exploring the mysteries of Large Language Models (LLM). It provides in-depth technical explanations, practical code implementations, and a platform for discussions and questions related to LLM. Join the journey to explore the fascinating world of large language models with LLM-Travel.

github

: 227

llm-inference-solutions

A collection of available inference solutions for Large Language Models (LLMs) including high-throughput engines, optimization libraries, deployment toolkits, and deep learning frameworks for production environments.

github

: 80

Fira

Fira is a memory-efficient training framework for Large Language Models (LLMs) that enables full-rank training under low-rank constraint. It introduces a method for training with full-rank gradients of full-rank weights, achieved with just two lines of equations. The framework includes pre-training and fine-tuning functionalities, packaged as a Python library for easy use. Fira utilizes Adam optimizer by default and provides options for weight decay. It supports pre-training LLaMA models on the C4 dataset and fine-tuning LLaMA-7B models on commonsense reasoning tasks.

github

: 61

xlstm-jax

The xLSTM-jax repository contains code for training and evaluating the xLSTM model on language modeling using JAX. xLSTM is a Recurrent Neural Network architecture that improves upon the original LSTM through Exponential Gating, normalization, stabilization techniques, and a Matrix Memory. It is optimized for large-scale distributed systems with performant triton kernels for faster training and inference.

github

: 74

For similar jobs

Awesome_Mamba

Awesome Mamba is a curated collection of groundbreaking research papers and articles on Mamba Architecture, a pioneering framework in deep learning known for its selective state spaces and efficiency in processing complex data structures. The repository offers a comprehensive exploration of Mamba architecture through categorized research papers covering various domains like visual recognition, speech processing, remote sensing, video processing, activity recognition, image enhancement, medical imaging, reinforcement learning, natural language processing, 3D recognition, multi-modal understanding, time series analysis, graph neural networks, point cloud analysis, and tabular data handling.

github

: 125

unilm

The 'unilm' repository is a collection of tools, models, and architectures for Foundation Models and General AI, focusing on tasks such as NLP, MT, Speech, Document AI, and Multimodal AI. It includes various pre-trained models, such as UniLM, InfoXLM, DeltaLM, MiniLM, AdaLM, BEiT, LayoutLM, WavLM, VALL-E, and more, designed for tasks like language understanding, generation, translation, vision, speech, and multimodal processing. The repository also features toolkits like s2s-ft for sequence-to-sequence fine-tuning and Aggressive Decoding for efficient sequence-to-sequence decoding. Additionally, it offers applications like TrOCR for OCR, LayoutReader for reading order detection, and XLM-T for multilingual NMT.

github

: 19.6k

llm-app-stack

LLM App Stack, also known as Emerging Architectures for LLM Applications, is a comprehensive list of available tools, projects, and vendors at each layer of the LLM app stack. It covers various categories such as Data Pipelines, Embedding Models, Vector Databases, Playgrounds, Orchestrators, APIs/Plugins, LLM Caches, Logging/Monitoring/Eval, Validators, LLM APIs (proprietary and open source), App Hosting Platforms, Cloud Providers, and Opinionated Clouds. The repository aims to provide a detailed overview of tools and projects for building, deploying, and maintaining enterprise data solutions, AI models, and applications.

github

: 929

awesome-deeplogic

Awesome deep logic is a curated list of papers and resources focusing on integrating symbolic logic into deep neural networks. It includes surveys, tutorials, and research papers that explore the intersection of logic and deep learning. The repository aims to provide valuable insights and knowledge on how logic can be used to enhance reasoning, knowledge regularization, weak supervision, and explainability in neural networks.

github

: 214

Awesome-LLMs-on-device

Welcome to the ultimate hub for on-device Large Language Models (LLMs)! This repository is your go-to resource for all things related to LLMs designed for on-device deployment. Whether you're a seasoned researcher, an innovative developer, or an enthusiastic learner, this comprehensive collection of cutting-edge knowledge is your gateway to understanding, leveraging, and contributing to the exciting world of on-device LLMs.

github

: 747

duo-attention

DuoAttention is a framework designed to optimize long-context large language models (LLMs) by reducing memory and latency during inference without compromising their long-context abilities. It introduces a concept of Retrieval Heads and Streaming Heads to efficiently manage attention across tokens. By applying a full Key and Value (KV) cache to retrieval heads and a lightweight, constant-length KV cache to streaming heads, DuoAttention achieves significant reductions in memory usage and decoding time for LLMs. The framework uses an optimization-based algorithm with synthetic data to accurately identify retrieval heads, enabling efficient inference with minimal accuracy loss compared to full attention. DuoAttention also supports quantization techniques for further memory optimization, allowing for decoding of up to 3.3 million tokens on a single GPU.

github

: 57

llm_note

LLM notes repository contains detailed analysis on transformer models, language model compression, inference and deployment, high-performance computing, and system optimization methods. It includes discussions on various algorithms, frameworks, and performance analysis related to large language models and high-performance computing. The repository serves as a comprehensive resource for understanding and optimizing language models and computing systems.

github

: 817

Awesome-Resource-Efficient-LLM-Papers

github

: 105

Awesome-Resource-Efficient-LLM-Papers

README:

Awesome Resource-Efficient LLM Papers

Table of Contents

LLM Architecture Design

Efficient Transformer Architecture

Non-transformer Architecture

LLM Pre-Training

Memory Efficiency

Distributed Training

Mixed precision training

Data Efficiency

Importance Sampling

Data Augmentation

Training Objective

LLM Fine-Tuning

Parameter-Efficient Fine-Tuning

Full-Parameter Fine-Tuning

LLM Inference

Model Compression

Pruning

Quantization

Dynamic Acceleration

Input Pruning

System Design

Deployment optimization

Support Infrastructure

Other Systems

Resource-Efficiency Evaluation Metrics & Benchmarks

🧮 Computation Metrics

💾 Memory Metrics

⚡️ Energy Metrics

💵 Financial Cost Metric

📨 Network Communication Metric

💡 Other Metrics

Benchmarks

Reference

For Tasks:

For Jobs:

Alternative AI tools for Awesome-Resource-Efficient-LLM-Papers

Similar Open Source Tools

Awesome-Resource-Efficient-LLM-Papers

LLM-KG4QA

AudioLLM

Awesome-LLM4IE-Papers

Awesome-Agent-Papers

Awesome-LLM-Papers-Comprehensive-Topics

Awesome-Model-Merging-Methods-Theories-Applications

LLM4EC

Awesome_LLM_System-PaperList

ai-game-devtools

speech-trident

Cool-GenAI-Fashion-Papers

ai-game-development-tools

Github-Ranking-AI

LLM4Opt

open-llms

For similar tasks

Awesome-Resource-Efficient-LLM-Papers

awesome-ai-efficiency

llm-finetuning

HighPerfLLMs2024

LLM-Travel

llm-inference-solutions

Fira

xlstm-jax

For similar jobs

Awesome_Mamba

unilm

llm-app-stack

awesome-deeplogic

Awesome-LLMs-on-device

duo-attention

llm_note

Awesome-Resource-Efficient-LLM-Papers