Efficient_Foundation_Model_Survey
Survey Paper List - Efficient LLM and Foundation Models
Stars: 192
Efficient Foundation Model Survey is a comprehensive analysis of resource-efficient large language models (LLMs) and multimodal foundation models. The survey covers algorithmic and systemic innovations to support the growth of large models in a scalable and environmentally sustainable way. It explores cutting-edge model architectures, training/serving algorithms, and practical system designs. The goal is to provide insights on tackling resource challenges posed by large foundation models and inspire future breakthroughs in the field.
README:
This repo contains the paper list and figures for A Survey of Resource-efficient LLM and Multimodal Foundation Models.
Large foundation models, including large language models (LLMs), vision transformers (ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine learning lifecycle, from training to deployment. However, the substantial advancements in versatility and performance these models offer come at a significant cost in terms of hardware resources. To support the growth of these large models in a scalable and environmentally sustainable way, there has been a considerable focus on developing resource-efficient strategies. This survey delves into the critical importance of such research, examining both algorithmic and systemic aspects. It offers a comprehensive analysis and valuable insights gleaned from existing literature, encompassing a broad array of topics from cutting-edge model architectures and training/serving algorithms to practical system designs and implementations. The goal of this survey is to provide an overarching understanding of how current approaches are tackling the resource challenges posed by large foundation models and to potentially inspire future breakthroughs in this field.
The scope of this survey is mainly defined by following aspects.
- We survey only algorithm and system innovations; we exclude a huge body of work at hardware design, which is out of our expertise.
- The definition of resource in this survey is limited to mainly physical ones, including computing, memory, storage, bandwidth, etc; we exclude training data (labels) and privacy that can also be regarded as resources;
- We mainly survey papers published on top-tier CS conferences, i.e., those included in CSRankings. We also manually pick related and potentially high-impact papers from arXiv.
- We mainly survey papers published after the year of 2020, since the innovation of AI is going fast with old knowledge and methods being overturned frequently.
@article{xu2024a,
title = {A Survey of Resource-efficient LLM and Multimodal Foundation Models},
author = {Xu, Mengwei and Yin, Wangsong and Cai, Dongqi and Yi, Rongjie
and Xu, Daliang and Wang, Qipeng and Wu, Bingyang and Zhao, Yihao and Yang, Chen
and Wang, Shihe and Zhang, Qiyang and Lu, Zhenyan and Zhang, Li and Wang, Shangguang
and Li, Yuanchun, and Liu Yunxin and Jin, Xin and Liu, Xuanzhe},
journal={arXiv preprint arXiv:2401.08092},
year = {2024}
}
If we leave out any important papers, please let us know in the Issues and we will include them in the next version.
We will actively maintain the survey and the Github repo.
- Foundation Model Overview
- Resource-efficient Architectures
- Resource-efficient Algorithms
- Resource-efficient Systems
- Attention is all you need. [arXiv'17] [Paper] [Code]
- Bert: Pre-training of deep bidirectional transformers for language understanding. [arXiv'18] [Paper] [Code]
- DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. [arXiv'19] [Paper] [Code]
- Roberta: A robustly optimized bert pretraining approach. [arXiv'19] [Paper] [Code]
- Sentence-bert: Sentence embeddings using siamese bert-networks. [EMNLP'19] [Paper] [Code]
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. [ACL'19] [Paper] [Code]
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. [arXiv'19] [Paper] [Code]
- Improving language understanding by generative pre-training. [URL]
- Language Models are Unsupervised Multitask Learners. [URL]
- Language Models are Few-Shot Learners. [NeurIPS'20] [Paper] [Code]
- GLM: General Language Model Pretraining with Autoregressive Blank Infilling. [arXiv'21] [Paper] [Code]
- Palm: Scaling language modeling with pathways. [JMLR'22] [Paper] [Code]
- Training language models to follow instructions with human feedback. [NeurIPS'22] [Paper]
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. [JMLR'22] [Paper]
- Glam: Efficient scaling of language models with mixture-ofexperts. [ICML'22] [Paper]
- wav2vec 2.0: A framework for self-supervised learning of speech representations. [NeurIPS'20] [Paper] [Code]
- HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. [TASLP'21] [Paper] [Code]
- Robust Speech Recognition via Large-Scale Weak Supervision. [ICML'23] [Paper]
- GPT-4 Technical Report. [arXiv'23] [Paper]
- Palm 2 technical report. [URL]
- Llama 2: Open foundation and fine-tuned chat models. [arXiv'23] [Paper] [Code]
- End-to-End Object Detection with Transformers. [ECCV'20] [Paper] [Code]
- Generative Pretraining from Pixels. [ICML'20] [Paper] [Code]
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. [ICLR'20] [Paper] [Code]
- Training data-efficient image transformers & distillation through attention. [ICML'21] [Paper] [Code]
- SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. [NeurIPS'21] [Paper] [Code]
- You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection. [NeurIPS'21] [Paper] [Code]
- Swin Transformer V2: Scaling Up Capacity and Resolution. [CVPR'22] [Paper] [Code]
- Masked Autoencoders Are Scalable Vision Learners. [CVPR'22] [Paper] [Code]
- Exploring Plain Vision Transformer Backbones for Object Detection. [ECCV'22] [Paper] [Code]
- BEiT: BERT Pre-Training of Image Transformers. [ICLR'22] [Paper] [Code]
- DINOv2: Learning Robust Visual Features without Supervision. [arXiv'20] [Paper]
- Sequential Modeling Enables Scalable Learning for Large Vision Models. [arXiv'23] [Paper] [Code]
- Learning transferable visual models from natural language supervision. [ICML'21] [Paper] [Code]
- Align before fuse: Vision and language representation learning with momentum distillation. [NeurIPS'21] [Paper] [Code]
- Scaling up visual and vision-language representation learning with noisy text supervision. [ICML'21] [Paper]
- Imagebind: One embedding space to bind them all. [CVPR'23] [Paper] [Code]
- Languagebind: Extending video-language pretraining to n-modality by language- based semantic alignment. [arXiv'23] [Paper] [Code]
- Pandagpt: One model to instruction-follow them all. [arXiv'23] [Paper] [Code]
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. [arXiv'23] [Paper] [Code]
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. [arXiv'23] [Paper] [Code]
- mplug-owl: Modularization empowers large language models with multi-modality. [arXiv'23] [Paper] [Code]
- Visual instruction tuning. [arXiv'23] [Paper] [Code]
- Flamingo: a visual language model for few-shot learning. [NeurIPS'22] [Paper]
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. [arXiv'23] [Paper] [Code]
- Palm-e: An embodied multimodal language model. [arXiv'23] [Paper] [Code]
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. [arXiv'23] [Paper] [Code]
- Any-to-any generation via composable diffusion. [arXiv'23] [Paper] [Code]
- Next-gpt: Any-to-any multimodal llm. [arXiv'23] [Paper] [Code]
- Uniter: Universal image-text representation learning. [ECCV'20] [Paper] [Code]
- Flava: A foundational language and vision alignment model. [CVPR'22] [Paper] [Code]
- Coca: Contrastive captioners are image-text foundation models. [arXiv'22] [Paper]
- Grounded language-image pre-training. [CVPR'22] [Paper] [Code]
- Segment anything. [arXiv'23] [Paper] [Code]
- Gemini: A Family of Highly Capable Multimodal Models. [arXiv'23] [Paper]
- Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V. [arXiv'23] [Paper] [Code]
- Auto-encoding variational bayes. [arXiv'13] [Paper]
- Neural discrete representation learning. [NeurIPS'17] [Paper] [Code]
- Denoising Diffusion Probabilistic Models. [NeurIPS'20] [Paper] [Code]
- Denoising diffusion implicit models. [ICLR'21] [Paper] [Code]
- Convolutional Networks for Biomedical Image Segmentation. [MICCAI'15] [Paper] [Code]
- High-Resolution Image Synthesis with Latent Diffusion Models. [CVPR'22] [Paper] [Code]
- Consistency models. [arXiv'23] [Paper] [Code]
- Zero-shot text-to-image generation. [ICML'21] [Paper] [Code]
- Any-to-Any Generation via Composable Diffusion. [arXiv'23] [Paper] [Code]
- MobileVLM V2: Faster and Stronger Baseline for Vision Language Model. [arXiv'24] [Paper] [Code]
- SAM 2: Segment Anything in Images and Videos. [arXiv'24] [Paper] [Code]
- Mobile Foundation Model as Firmware. [MobiCom'24] [Paper] [Code]
- Longformer: The long-document transformer. [arXiv'20] [Paper] [Code]
- ETC: Encoding Long and Structured Inputs in Transformers. [ACL'20] [Paper] [Code]
- Big bird: Transformers for longer sequences. [NeurIPS'2] [Paper] [Code]
- Efficient Attentions for Long Document Summarization. [NAACL'21] [Paper] [Code]
- MATE: Multi-view Attention for Table Transformer Efficiency. [EMNLP'21] [Paper] [Code]
- LittleBird: Efficient Faster & Longer Transformer for Question Answering. [arXiv'23] [Paper] [Code]
- Albert: A lite bert for self-supervised learning of language representations. [arXiv'19] [Paper] [Code]
- An efficient encoder-decoder architecture with top-down attention for speech separation. [ICLR'23] [Paper] [Code]
- Reformer: The Efficient Transformer. [ICLR'20] [Paper] [Code]
- Transformers are rnns: Fast autoregressive transformers with linear attention. [ICML'20] [Paper] [Code]
- Linformer: Self-Attention with Linear Complexity. [arXiv'20] [Paper] [Code]
- Luna: Linear unified nested attention. [NeurIPS'21] [Paper] [Code]
- Rethinking Attention with Performers. [arXiv'20] [Paper] [Code]
- PolySketchFormer: Fast Transformers via Sketches for Polynomial Kernels. [arXiv'23] [Paper]
- Mega: Moving Average Equipped Gated Attention. [ICLR'23] [Paper] [Code]
- Vision Transformer with Deformable Attention. [arXiv'22] [Paper] [Code]
- CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. [arXiv'21] [Paper] [Code]
- An attention free transformer. [arXiv'21] [Paper] [Code]
- Hyena hierarchy: Towards larger convolutional language models. [arXiv'23] [Paper]
- Perceiver: General perception with iterative attention. [ICML'21] [Paper] [Code]
- Scaling transformer to 1m tokens and beyond with rmt. [arXiv'23] [Paper]
- Recurrent memory transformer. [NeurIPS'22] [Paper] [Code]
- RWKV: Reinventing RNNs for the Transformer Era. [arXiv'23] [Paper] [Code]
- Retentive Network: A Successor to Transformer for Large Language Model. [arXiv'23] [Paper] [Code]
- Efficiently modeling long sequences with structured state spaces. [ICLR'22] [Paper] [Code]
- Hungry hungry hippos: Towards language modeling with state space models. [ICLR'23] [Paper] [Code]
- Resurrecting recurrent neural networks for long sequences. [arXiv'23] [Paper] [Code]
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces. [arXiv'23] [Paper] [Code]
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. [JMLR'22] [Paper] [Code]
- Scaling vision with sparse mixture of experts. [NeruIPS'21] [Paper] [Code]
- Glam: Efficient scaling of language models with mixture-of-experts. [ICML'22] [Paper] [Code]
- Multimodal contrastive learning with limoe: the language-image mixture of experts. [NeruIPS'22] [Paper] [Code]
- Mistral 7B. [arXiv'23] [Paper] [Code]
- Fast Feedforward Networks. [arXiv'23] [Paper] [Code]
- MoEfication: Transformer Feed-forward Layers are Mixtures of Experts. [ACL'22] [Paper] [Code]
- Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints. [arXiv'23] [Paper] [Code]
- Simplifying Transformer Blocks. [arXiv'23] [Paper] [Code]
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding. [arXiv'23] [Paper] [Code]
- Bert loses patience: Fast and robust inference with early exit. [NeruIPS'20] [Paper] [Code]
- DeeBERT: Dynamic early exiting for accelerating BERT inference. [arXiv'20] [Paper] [Code]
- LGViT: Dynamic Early Exiting for Accelerating Vision Transformer. [MM'23] [Paper] [Code]
- Multi-Exit Vision Transformer for Dynamic Inference. [arXiv'21] [Paper]
- Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference. [arXiv'23] [Paper]
- Improved denoising diffusion probabilistic models. [arXiv'21] [Paper] [Code]
- Accelerating diffusion models via early stop of the diffusion process. [arXiv'22] [Paper] [Code]
- Denoising diffusion implicit models. [ICLR'21] [Paper] [Code]
- gDDIM: Generalized denoising diffusion implicit models. [arXiv'22] [Paper] [Code]
- Pseudo numerical methods for diffusion models on manifolds. [arXiv'22] [Paper] [Code]
- Elucidating the design space of diffusion-based generative models. [arXiv'22] [Paper] [Code]
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. [NeurIPS'22] [Paper] [Code]
- Progressive distillation for fast sampling of diffusion models. [arXiv'22] [Paper] [Code]
- Fast sampling of diffusion models with exponential integrator. [arXiv'22] [Paper] [Code]
- Score-based generative modeling through stochastic differential equations. [arXiv'20] [Paper] [Code]
- Learning fast samplers for diffusion models by differentiating through sample quality. [arXiv'22] [Paper]
- Redi: efficient learning-free diffusion inference via trajectory retrieval. [ICML'23] [Paper] [Code]
- Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models. [NSDI'24] [Paper]
- Salad: Part-level latent diffusion for 3d shape generation and manipulation. [ICCV'23] [Paper] [Code]
- Binary Latent Diffusion. [CVPR'23] [Paper] [Code]
- LD-ZNet: A latent diffusion approach for text-based image segmentation. [ICCV'23] [Paper] [Code]
- Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. [CVPR'23] [Paper] [Code]
- High-resolution image reconstruction with latent diffusion models from human brain activity. [CVPR'23] [Paper] [Code]
- Belfusion: Latent diffusion for behavior-driven human motion prediction. [ICCV'23] [Paper] [Code]
- Unified multi-modal latent diffusion for joint subject and text conditional image generation. [arXiv'23] [Paper]
- SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds. [arXiv'23] [Paper] [Code]
- ERNIE-ViLG 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. [CVPR'23] [Paper] [Code]
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. [arXiv'23] [Paper]
- ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models. [arXiv'23] [Paper] [Code]
- Image Super-resolution Via Latent Diffusion: A Sampling-space Mixture Of Experts And Frequency-augmented Decoder Approach. [arXiv'23] [Paper] [Code]
- LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference. [ICCV'21] [Paper] [Code]
- MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. [ICLR'22] [Paper] [Code]
- EfficientFormer: Vision Transformers at MobileNet Speed. [NeurIPS'22] [Paper] [Code]
- EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention. [CVPR'23] [Paper] [Code]
- MetaFormer Is Actually What You Need for Vision. [CVPR'22] [Paper] [Code]
- Deduplicating Training Data Makes Language Models Better. [ACL'22] [Paper] [Code]
- TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection. [EMNLP'22] [Paper]
- Masked autoencoders are scalable vision learners. [CVPR'22] [Paper] [Code]
- MixMAE: Mixed and masked autoencoder for efficient pretraining of hierarchical vision transformers. [CVPR'23] [Paper] [Code]
- COPA: Efficient Vision-Language Pre-training through Collaborative Object-and Patch-Text Alignment. [MM'23] [Paper]
- Patchdropout: Economizing vision transformers using patch dropout. [WACV'23] [Paper] [Code]
- Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers. [CVPR'23] [Paper] [Code]
- Zero-Cost Proxies for Lightweight NAS. [ICLR'21] [Paper] [Code]
- ZiCo: Zero-shot NAS via inverse Coefficient of Variation on Gradients. [ICLR'23] [Paper] [Code]
- PASHA: Efficient HPO and NAS with Progressive Resource Allocation. [ICLR'23] [Paper] [Code]
- RankNAS: Efficient Neural Architecture Search by Pairwise Ranking. [EMNLP'21] [Paper]
- PreNAS: Preferred One-Shot Learning Towards Efficient Neural Architecture Search. [ICML'23] [Paper] [Code]
- ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices. [ICCV'23] [Paper] [Code]
- Efficient training of BERT by progressively stacking. [ICML'19] [Paper] [Code]
- On the Transformer Growth for Progressive BERT Training. [NAACL'21] [Paper] [Code]
- Staged training for transformer language models. [ICML'22] [Paper] [Code]
- Knowledge Inheritance for Pre-trained Language Models. [NAACL'22] [Paper] [Code]
- Learning to Grow Pretrained Models for Efficient Transformer Training. [ICLR'23] [Paper] [Code]
- Mesa: A memory-saving training framework for transformers. [arXiv'21] [Paper] [Code]
- GACT: Activation compressed training for generic network architectures. [ICML'22] [Paper] [Code]
- Memory efficient continual learning with transformers. [NeurIPS'22] [Paper]
- Metatroll: Few-shot detection of state-sponsored trolls with transformer adapters. [WWW'23] [Paper] [Code]
- St-adapter: Parameter-efficient image-to- video transfer learning. [NeurIPS'22] [Paper] [Code]
- Parameter-efficient fine-tuning without introducing new latency. [arXiv'23] [Paper] [Code]
- Adamix: Mixture-of-adaptations for parameter-efficient model tuning. [arXiv'22] [Paper] [Code]
- Residual adapters for parameter-efficient asr adaptation to atypical and accented speech. [arXiv'21] [Paper]
- Make your pre-trained model reversible: From parameter to memory efficient fine-tuning. [arXiv'23] [Paper] [Code]
- Pema: Plug-in external memory adaptation for language models. [arXiv'23] [Paper]
- The power of scale for parameter-efficient prompt tuning. [arXiv'21] [Paper]
- Attempt: Parameter-efficient multi-task tuning via attentional mixtures of soft prompts. [EMNLP'22] [Paper] [Code]
- Mprompt: Exploring multi-level prompt tuning for machine reading comprehension. [arXiv'23] [Paper] [Code]
- Bioinstruct: Instruction tuning of large language models for biomedical natural language processing. [arXiv'23] [Paper] [Code]
- Decomposed prompt tuning via low-rank reparameterization. [arXiv'23] [Paper] [Code]
- A dual prompt learning framework for few-shot dialogue state tracking. [WWW'23] [Paper] [Code]
- User-aware prefix-tuning is a good learner for personalized image captioning. [arXiv'23] [Paper]
- Prefix-diffusion: A lightweight diffusion model for diverse image captioning. [arXiv'23] [Paper]
- Domain aligned prefix averaging for domain generalization in abstractive summarization. [arXiv'23] [Paper] [Code]
- Prefix propagation: Parameter-efficient tuning for long sequences. [arXiv'23] [Paper] [Code]
- Pip: Parse-instructed prefix for syntactically controlled paraphrase generation. [arXiv'23] [Paper] [Code]
- Towards building the federated gpt: Federated instruction tuning. [arXiv'23] [Paper] [Code]
- Domain-oriented prefix-tuning: Towards efficient and generalizable fine-tuning for zero-shot dialogue summarization. [arXiv'23] [Paper]
- Unified low-resource sequence labeling by sample-aware dynamic sparse finetuning. [arXiv'23] [Paper] [Code]
- On the effectiveness of parameter-efficient fine-tuning. [AAAI'23] [Paper] [Code]
- Sensitivity-aware visual parameter-efficient fine-tuning. [ICCV'23] [Paper] [Code]
- VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control. [ICCV'23] [Paper] [Code] [Project]
- Smartfrz: An efficient training framework using attention-based layer freezing. [ICLR'23] [Paper]
- Token mixing: parameter-efficient transfer learning from image-language to video-language. [AAAI'23] [Paper] [Code]
- One-for-all: Generalized lora for parameter-efficient fine-tuning. [arXiv'23] [Paper] [Code]
- Dsee: Dually sparsity-embedded efficient tuning of pre-trained language models. [arXiv'21] [Paper] [Code]
- Longlora: Efficient fine-tuning of long-context large language models. [arXiv'23] [Paper] [Code]
- Qlora: Efficient finetuning of quantized llms. [arXiv'23] [Paper] [Code]
- Pela: Learning parameter-efficient models with low-rank approximation. [arXiv'23] [Paper] [Code]
- Efficientdm: Efficient quantization-aware fine-tuning of low-bit diffusion models. [arXiv'23] [Paper]
- Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. [arXiv'23] [Paper]
- Loftq: Lora-fine-tuning-aware quantization for large language models. [arXiv'23] [Paper] [Code]
- Full parameter fine-tuning for large language models with limited resources. [arXiv'23] [Paper] [Code]
- Fine-tuning language models with just forward passes. [arXiv'23] [Paper] [Code]
- Efficient transformers with dynamic token pooling. [arXiv'23] [Paper] [Code]
- Qa-lora: Quantization-aware low-rank adaptation of large language models. [arXiv'23] [Paper] [Code]
- Efficient low-rank backpropagation for vision transformer adaptation. [arXiv'23] [Paper]
- Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices. [arXiv'23] [Paper]
- PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models. [arXiv'24] [Paper] [Code]
- DoRA: Weight-Decomposed Low-Rank Adaptation. [ICML'24] [Paper] [Code]
- LoRA+: Efficient Low Rank Adaptation of Large Models. [ICML'24] [Paper]
- Towards Green AI in Fine-Tuning Large Language Models via Adaptive Backpropagation. [ICLR'24] [Paper]
- Fast inference from transformers via speculative decoding. [ICML'23] [Paper] [Code]
- Accelerating Large Language Model Decoding with Speculative Sampling. [arXiv'23] [Paper] [Code]
- SpecTr: Fast Speculative Decoding via Optimal Transport. [NeurIPS'23] [Paper] [Code]
- ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training. [EMNLP'20] [Paper] [Code]
- Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding. [arXiv'23] [Paper] [Code]
- LLMCad: Fast and Scalable On-device Large Language Model Inference. [arXiv'23] [Paper]
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads. [URL]
- Break the Sequential Dependency of LLM Inference Using Lookahead Decoding. [URL]
- SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification. [arXiv'23] [Paper] [Code]
- Prompt Cache: Modular Attention Reuse for Low-Latency Inference. [arXiv'23] [Paper]
- Inference with Reference: Lossless Acceleration of Large Language Models. [arXiv'23] [Paper] [Code]
- Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding. [arXiv'23] [Paper] [Code]
- LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. [EMNLP'23] [Paper] [Code]
- Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models. [EMNLP'22] [Paper] [Code]
- Etropyrank: Unsupervised keyphrase extraction via side-information optimization for language model-based text compression. [ICML'23] [Paper]
- LLMZip: Lossless Text Compression using Large Language Models. [arXiv'23] [Paper] [Code]
- In-context Autoencoder for Context Compression in a Large Language Mode. [arXiv'23] [Paper] [Code]
- Nugget 2D: Dynamic Contextual Compression for Scaling Decoder-only Language Models. [arXiv'23] [Paper]
- Boosting LLM Reasoning: Push the Limits of Few-shot Learning with Reinforced In-Context Pruning. [arXiv'23] [Paper]
- PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination. [ICML'20] [Paper] [Code]
- Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search. [ACL'21] [Paper] [Code]
- TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference. [NAACL'21] [Paper] [Code]
- DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification. [NeurIPS'21] [Paper] [Code]
- AdaViT: Adaptive Vision Transformers for Efficient Image Recognition. [CVPR'21] [Paper]
- AdaViT: Adaptive Tokens for Efficient Vision Transformer. [CVPR'22] [Paper] [Code]
- SPViT: Enabling Faster Vision Transformers via Soft Token Pruning. [ECCV'22] [Paper] [Code]
- PuMer: Pruning and Merging Tokens for Efficient Vision Language Models. [ACL'23] [Paper] [Code]
- H2o: Heavy-hitter oracle for efficient generative inference of large language models. [NeurIPS'23] [Paper] [Code]
- Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers. [arXiv'23] [Paper] [Code]
- Landmark Attention: Random-Access Infinite Context Length for Transformers. [NeurIPS'23] [Paper] [Code]
- Train short, test long: Attention with linear biases enables input length extrapolation. [ICLR'22] [Paper] [Code]
- A Length-Extrapolatable Transformer. [ACL'22] [Paper] [Code]
- CLEX: Continuous Length Extrapolation for Large Language Models. [arXiv'23] [Paper] [Code]
- Extending Context Window of Large Language Models via Positional Interpolation. [arXiv'23] [Paper]
- YaRN: Efficient Context Window Extension of Large Language Models. [arXiv'23] [Paper] [Code]
- functional interpolation for relative positions improves long context transformers. [arXiv'23] [Paper]
- PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training. [arXiv'23] [Paper] [Code]
- Recurrent Memory Transformer. [NeurIPS'22] [Paper] [Code]
- Block-Recurrent Transformers. [NeurIPS'22] [Paper] [Code]
- Memformer: A Memory-Augmented Transformer for Sequence Modeling. [ACL'22] [Paper] [Code]
- LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models. [arXiv'23] [Paper] [Code]
- Efficient Streaming Language Models with Attention Sinks. [arXiv'23] [Paper] [Code]
- Parallel context windows for large language models. [ACL'23] [Paper] [Code]
- LongNet: Scaling Transformers to 1,000,000,000 Tokens. [arXiv'23] [Paper] [Code]
- Efficient Long-Text Understanding with Short-Text Models. [TACL'23] [Paper] [Code]
- Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models. [arXiv'24] [Paper] [Code]
- LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images. [ECCV'24] [Paper] [Code]
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. [arXiv'24] [Paper] [Code]
- Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs. [ICLR'24] [Paper]
- vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention. [arXiv'24] [Paper] [Code]
- LLM as a System Service on Mobile Devices. [arXiv'24] [Paper]
- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. [arXiv'24] [Paper] [Code]
- CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. [SIGCOMM'24] [Paper] [Code]
- InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. [OSDI'24] [Paper]
- Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. [ATC'24] [Paper]
- From Dense to Sparse: Contrastive Pruning for Better Pre-Trained Language Model Compression. [AAAI'22] [Paper]
- Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models. [NeurIPS'22] [Paper] [Code]
- ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design. [HPCA'23] [Paper] [Code]
- A Simple and Effective Pruning Approach for Large Language Models. [arXiv'23] [Paper] [Code]
- Dynamic Sparse Training: Find Efficient Sparse Network From Scratch With Trainable Masked Layers. [ICLR'20] [Paper] [Code]
- UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers. [ICML'23] [Paper] [Code]
- Sparsegpt: Massive language models can be accurately pruned in one-shot. [arXiv'23] [Paper] [Code]
- One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models. [ICASSP'24] [Paper] [Code]
- BiT: Robustly Binarized Multi-distilled Transformer. [NeurIPS'22] [Paper] [Code]
- DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models. [ACL'23] [Paper] [Code]
- Block-Skim: Efficient Question Answering for Transformer. [AAAI'22] [Paper] [Code]
- Depgraph: Towards any structural pruning. [CVPR'23] [Paper] [Code]
- PLATON: Pruning Large Transformer Models with Upper Confidence Bound of Weight Importance. [ICML'22] [Paper] [Code]
- Differentiable joint pruning and quantization for hardware efficiency. [ECCV'20] [Paper]
- SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. [HPCA'21] [Paper] [Code]
- Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning. [arXiv'23] [Paper] [Code]
- Accelerated sparse neural training: A provable and efficient method to find n: m transposable masks. [NeurIPS'21] [Paper] [Code]
- Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning. [arXiv'23] [Paper]
- What matters in the structured pruning of generative language models?. [arXiv'23] [Paper]
- LLM-Pruner: On the Structural Pruning of Large Language Models. [NeurIPS'23] [Paper] [Code]
- Deja vu: Contextual sparsity for efficient llms at inference time. [ICML'23] [Paper] [Code]
- PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. [arXiv'23] [Paper] [Code]
- Distilling Large Vision-Language Model with Out-of-Distribution Generalizability. [ICCV'23] [Paper] [Code]
- DIME-FM : DIstilling Multimodal and Efficient Foundation Models. [ICCV'23] [Paper] [Code]
- MixKD: Towards Efficient Distillation of Large-scale Language Models. [arXiv'20] [Paper]
- Less is More: Task-aware Layer-wise Distillation for Language Model Compression. [arXiv'22] [Paper] [Code]
- DISTILLM: Towards Streamlined Distillation for Large Language Models. [arXiv'24] [Paper] [Code]
- Propagating Knowledge Updates to LMs Through Distillation. [arXiv'23] [Paper]
- GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models. [arXiv'23] [Paper] [Code]
- Knowledge Distillation of Large Language Models. [arXiv'23] [Paper]
- Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty. [ACL'23] [Paper] [Code]
- Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. [ACL'23] [Paper] [Code]
- Teaching Small Language Models to Reason. [ACL'22] [Paper]
- Explanations from Large Language Models Make Small Reasoners Better. [arXiv'22] [Paper]
- Lion: Adversarial distillation of closed-source large language model. [arXiv'23] [Paper] [Code]
- LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions. [arXiv'23] [Paper] [Code]
- LLM. int8 (): 8-bit Matrix Multiplication for Transformers at Scale. [arXiv'22] [Paper] [Code]
- LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models. [arXiv'22] [Paper]
- Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning. [NeurIPS'22] [Paper] [Code]
- GPTQ: accurate post-training quantization for generative pre-trained transformers. [ICLR'23] [Paper] [Code]
- Few-bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction. [ICML'22] [Paper] [Code]
- SqueezeLLM: Dense-and-Sparse Quantization. [arXiv'23] [Paper] [Code]
- SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. [arXiv'23] [Paper] [Code]
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. [arXiv'23] [Paper] [Code]
- QuIP: 2-Bit Quantization of Large Language Models With Guarantees. [NeurIPS'23] [Paper] [Code]
- OWQ: Lessons learned from activation outliers for weight quantization in large language models. [arXiv'23] [Paper] [Code]
- FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs. [arXiv'23] [Paper]
- BinaryBERT: Pushing the Limit of BERT Quantization. [ACL'21] [Paper] [Code]
- I-BERT: Integer-only BERT Quantization. [ICML'21] [Paper] [Code]
- Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models. [NeurIPS'22] [Paper] [Code]
- SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. [ICML'22] [Paper] [Code]
- ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. [NeurIPS'22] [Paper] [Code]
- Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer. [NeurIPS'22] [Paper] [Code]
- RPTQ: Reorder-based Post-training Quantization for Large Language Models. [arXiv'23] [Paper] [Code]
- Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. [ACL'23] [Paper] [Code]
- ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats. [arXiv'23] [Paper]
- ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation. [arXiv'23] [Paper]
- I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference. [ICCV'23] [Paper] [Code]
- Q-Diffusion: Quantizing Diffusion Models. [ICCV'23] [Paper] [Code]
- OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization. [ISCA'23] [Paper]
- QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models. [arXiv'23] [Paper]
- Integer or floating point? new outlooks for low-bit quantization on large language models. [arXiv'23] [Paper]
- Oscillation-free Quantization for Low-bit Vision Transformers. [ICML'23] [Paper] [Code]
- FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization. [ICML'23] [Paper]
- OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models. [arXiv'23] [Paper] [Code]
- LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. [arXiv'23] [Paper] [Code]
- Compression of generative pre-trained language models via quantization. [ACL'22] [Paper]
- BitNet: Scaling 1-bit Transformers for Large Language Models. [arXiv'23] [Paper] [Code]
- QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models. [arXiv'23] [Paper]
- LLM-FP4: 4-Bit Floating-Point Quantized Transformers. [arXiv'23] [Paper]
- Atom: Low-bit Quantization for Efficient and Accurate LLM Serving. [arXiv'23] [Paper]
- Matrix Compression via Randomized Low Rank and Low Precision Factorization. [NeurIPS'23] [Paper] [Code]
- TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition. [arXiv'23] [Paper]
- LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression. [arXiv'23] [Paper]
- ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with Linear Taylor Attention. [HPCA'23] [Paper] [Code]
- SpinQuant: LLM Quantization with Learned Rotations. [arXiv'24] [Paper]
- QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. [arXiv'24] [Paper]
- I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models. [arXiv'24] [Paper]
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models. [arXiv'24] [Paper] [Code]
- PowerInfer-2: Fast Large Language Model Inference on a Smartphone. [arXiv'24] [Paper] [Code]
- Achieving Sparse Activation in Small Language Models. [arXiv'24] [Paper] [Code]
- Optimizing Dynamic Neural Networks with Brainstorm. [OSDI'23] [Paper] [Code]
- GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints. [SOSP'23] [Paper]
- Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates. [SOSP'23] [Paper] [Code]
- Varuna: Scalable, Low-cost Training of Massive Deep Learning Models. [EuroSys'22] [Paper] [Code]
- HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism. [ATC'20] [Paper]
- ZeRO-Offload: Democratizing Billion-Scale Model Training. [ATC'21] [Paper] [Code]
- Whale: Efficient Giant Model Training over Heterogeneous GPUs. [ATC'22] [Paper] [Code]
- SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization. [ATC'23] [Paper] [Code]
- Behemoth: A Flash-centric Training Accelerator for Extreme-scale DNNs. [FAST'21] [Paper]
- FlashNeuron: SSD-Enabled Large-Batch Training of Very Deep Neural Networks. [FAST'21] [Paper] [Code]
- Sequence Parallelism: Long Sequence Training from System Perspective. [ACL'23] [Paper] [Code]
- Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models. [ASPLOS'23] [Paper]
- Mobius: Fine Tuning Large-scale Models on Commodity GPU Servers. [ASPLOS'23] [Paper]
- Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression. [ASPLOS'23] [Paper] [Code]
- Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads. [ASPLOS'22] [Paper] [Code]
- FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement. [SIGMOD'23] [Paper] [Code]
- On Optimizing the Communication of Model Parallelism. [MLSys'23] [Paper] [Code]
- Reducing Activation Recomputation in Large Transformer Models. [MLSys'23] [Paper] [Code]
- PipeFisher: Efficient Training of Large Language Models Using Pipelining and Fisher Information Matrices. [MLSys'23] [Paper] [Code]
- Breadth-First Pipeline Parallelism. [MLSys'23] [Paper]
- MegaBlocks: Efficient Sparse Training with Mixture-of-Experts. [MLSys'23] [Paper] [Code]
- Tutel: Adaptive Mixture-of-Experts at Scale. [MLSys'23] [Paper] [Code]
- Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. [ICLR'20] [Paper] [Code]
- Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism. [VLDB'23] [Paper] [Code]
- MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud. [VLDB'23] [Paper]
- Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism. [ATC'21] [Paper] [Code]
- Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs. [NSDI'23] [Paper] [Code]
- Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models. [SIGCOMM'23] [Paper]
- MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters. [NSDI'22] [Paper] [Code]
- Zero: Memory optimizations toward training trillion parameter models. [SC'20] [Paper] [Code]
- Efficient large-scale language model training on gpu clusters using megatron-lm. [HPC'21] [Paper] [Code]
- Alpa: Automating inter-and Intra-Operator parallelism for distributed deep learning. [OSDI'22] [Paper] [Code]
- Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training. [ICPC'23] [Paper] [Code]
- Megatron-lm: Training multi-billion parameter language models using model parallelism. [arXiv'19] [Paper] [Code]
- Pytorch FSDP: experiences on scaling fully sharded data parallel. [arXiv'23] [Paper] [Code]
- DeepSpeed. [URL]
- Huggingface PEFT. [URL]
- FairScale. [URL]
- OpenLLM: Operating LLMs in production. [URL]
- Flower: A friendly federated learning research framework. [arXiv'20] [Paper] [Code]
- Fedml: A research library and benchmark for federated machine learning. [arXiv'20] [Paper] [Code]
- FedNLP: Benchmarking Federated Learning Methods for Natural Language Processing Tasks. [NAACL'22] [Paper] [Code]
- FATE-LLM: A Industrial Grade Federated Learning Framework for Large Language Models. [arXiv'23] [Paper] [Code]
- Federatedscope-llm: A comprehensive package for fine-tuning large language models in federated learning. [arXiv'23] [Paper] [Code]
- Federated Self-supervised Speech Representations: Are We There Yet?. [arXiv'23] [Paper]
- Towards Building the Federated GPT: Federated Instruction Tuning. [arXiv'23] [Paper] [Code]
- Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly. [arXiv'23] [Paper]
- Privacy-Preserving Fine-Tuning of Artificial Intelligence (AI) Foundation Models with Federated Learning, Differential Privacy, Offsite Tuning, and Parameter-Efficient Fine-Tuning (PEFT). [TechRxiv'23] [Paper]
- Efficient federated learning for modern nlp. [MobiCom'23] [Paper] [Code]
- Federated few-shot learning for mobile NLP. [MobiCom'23] [Paper] [Code]
- Low-parameter federated learning with large language models. [arXiv'23] [Paper] [Code]
- FedPrompt: Communication-Efficient and Privacy-Preserving Prompt Tuning in Federated Learning. [ICASSP'23] [Paper]
- Reducing Communication Overhead in Federated Learning for Pre-trained Language Models Using Parameter-Efficient Finetuning. [Conference on Lifelong Learning Agents'23] [Paper]
- FEDBFPT: An efficient federated learning framework for BERT further pre-training. [AAAI'23] [Paper] [Code]
- FedOBD: Opportunistic Block Dropout for Efficiently Training Large-scale Neural Networks through Federated Learning. [arXiv'22] [Paper]
- FedBERT: When federated learning meets pre-training. [TIST'22] [Paper] [Code]
- FedPerfix: Towards Partial Model Personalization of Vision Transformers in Federated Learning. [CVPR'23] [Paper] [Code]
- Federated fine-tuning of billion-sized language models across mobile devices. [arXiv'23] [Paper] [Code]
- Efficient Federated Prompt Tuning for Black-box Large Pre-trained Models. [arXiv'23] [Paper]
- Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes. [arXiv'23] [Paper]
- Orca: A Distributed Serving System for Transformer-Based Generative Models. [OSDI'22] [Paper]
- SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. [arXiv'23] [Paper]
- Fast Distributed Inference Serving for Large Language Models. [arXiv'23] [Paper]
- FlexGen: high-throughput generative inference of large language models with a single GPU. [ICML'23] [Paper] [Code]
- DeepSpeed-FastGen. [URL]
- Splitwise: Efficient Generative LLM Inference Using Phase Splitting. [arXiv'23] [Paper]
- Efficiently Scaling Transformer Inference. [MLSys'23] [Paper]
- DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. [SC'22] [Paper]
- FlashDecoding++: Faster Large Language Model Inference on GPUs. [arXiv'23] [Paper]
- Flash-Decoding for long-context inference. [URL]
- A High-Performance Transformer Boosted for Variable-Length Inputs. [IPDPS'23] [Paper] [Code]
- SpotServe: Serving Generative Large Language Models on Preemptible Instances. [ASPLOS'24] [Paper] [Code]
- HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment. [arXiv'23] [Paper] [Code]
- Punica: Multi-Tenant LoRA Serving. [arXiv'23] [Paper] [Code]
- SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models. [arXiv'23] [Paper] [Code]
- Efficient Memory Management for Large Language Model Serving with PagedAttention. [SOSP'23] [Paper] [Code]
- Efficiently Programming Large Language Models using SGLang. [arXiv'23] [Paper]
- Batched Low-Rank Adaptation of Foundation Models. [ICLR'24] [Paper]
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. [OSDI'24] [Paper] [Code]
- EdgeFM: Leveraging Foundation Model for Open-set Learning on the Edge. [SenSys'23] [Paper]
- EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models. [arXiv'23] [Paper]
- Serving MoE Models on Resource-constrained Edge Devices via Dynamic Expert Swapping. [arXiv'23] [Paper]
- LLMCad: Fast and Scalable On-device Large Language Model Inference. [arXiv'23] [Paper]
- STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining. [ASPLOS'23] [Paper]
- Practical Edge Kernels for Integer-Only Vision Transformers Under Post-training Quantization. [MLSys'23] [Paper]
- PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. [arXiv'23] [Paper] [Code]
- PowerInfer-2: Fast Large Language Model Inference on a Smartphone. [arXiv'24] [Paper] [Code]
- LLM in a flash: Efficient Large Language Model Inference with Limited Memory. [arXiv'23] [Paper]
- On-Device Language Models: A Comprehensive Review. [arXiv'24] [Paper]
- LLM as a System Service on Mobile Devices. [arXiv'24] [Paper]
- ELMS: Elasticized Large Language Models On Mobile Devices. [arXiv'24] [Paper]
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for Efficient_Foundation_Model_Survey
Similar Open Source Tools
Efficient_Foundation_Model_Survey
Efficient Foundation Model Survey is a comprehensive analysis of resource-efficient large language models (LLMs) and multimodal foundation models. The survey covers algorithmic and systemic innovations to support the growth of large models in a scalable and environmentally sustainable way. It explores cutting-edge model architectures, training/serving algorithms, and practical system designs. The goal is to provide insights on tackling resource challenges posed by large foundation models and inspire future breakthroughs in the field.
awesome-llm-understanding-mechanism
This repository is a collection of papers focused on understanding the internal mechanism of large language models (LLM). It includes research on topics such as how LLMs handle multilingualism, learn in-context, and handle factual associations. The repository aims to provide insights into the inner workings of transformer-based language models through a curated list of papers and surveys.
AwesomeLLM4APR
Awesome LLM for APR is a repository dedicated to exploring the capabilities of Large Language Models (LLMs) in Automated Program Repair (APR). It provides a comprehensive collection of research papers, tools, and resources related to using LLMs for various scenarios such as repairing semantic bugs, security vulnerabilities, syntax errors, programming problems, static warnings, self-debugging, type errors, web UI tests, smart contracts, hardware bugs, performance bugs, API misuses, crash bugs, test case repairs, formal proofs, GitHub issues, code reviews, motion planners, human studies, and patch correctness assessments. The repository serves as a valuable reference for researchers and practitioners interested in leveraging LLMs for automated program repair.
Awesome-LLM-Compression
Awesome LLM compression research papers and tools to accelerate LLM training and inference.
Awesome-Robotics-3D
Awesome-Robotics-3D is a curated list of 3D Vision papers related to Robotics domain, focusing on large models like LLMs/VLMs. It includes papers on Policy Learning, Pretraining, VLM and LLM, Representations, and Simulations, Datasets, and Benchmarks. The repository is maintained by Zubair Irshad and welcomes contributions and suggestions for adding papers. It serves as a valuable resource for researchers and practitioners in the field of Robotics and Computer Vision.
AI-resources
AI-resources is a repository containing links to various resources for learning Artificial Intelligence. It includes video lectures, courses, tutorials, and open-source libraries related to deep learning, reinforcement learning, machine learning, and more. The repository categorizes resources for beginners, average users, and advanced users/researchers, providing a comprehensive collection of materials to enhance knowledge and skills in AI.
Efficient-LLMs-Survey
This repository provides a systematic and comprehensive review of efficient LLMs research. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient LLMs topics from **model-centric** , **data-centric** , and **framework-centric** perspective, respectively. We hope our survey and this GitHub repository can serve as valuable resources to help researchers and practitioners gain a systematic understanding of the research developments in efficient LLMs and inspire them to contribute to this important and exciting field.
do-research-in-AI
This repository is a collection of research lectures and experience sharing posts from frontline researchers in the field of AI. It aims to help individuals upgrade their research skills and knowledge through insightful talks and experiences shared by experts. The content covers various topics such as evaluating research papers, choosing research directions, research methodologies, and tips for writing high-quality scientific papers. The repository also includes discussions on academic career paths, research ethics, and the emotional aspects of research work. Overall, it serves as a valuable resource for individuals interested in advancing their research capabilities in the field of AI.
Awesome-Code-LLM
Analyze the following text from a github repository (name and readme text at end) . Then, generate a JSON object with the following keys and provide the corresponding information for each key, in lowercase letters: 'description' (detailed description of the repo, must be less than 400 words,Ensure that no line breaks and quotation marks.),'for_jobs' (List 5 jobs suitable for this tool,in lowercase letters), 'ai_keywords' (keywords of the tool,user may use those keyword to find the tool,in lowercase letters), 'for_tasks' (list of 5 specific tasks user can use this tool to do,in lowercase letters), 'answer' (in english languages)
Awesome-LLM-Robotics
This repository contains a curated list of **papers using Large Language/Multi-Modal Models for Robotics/RL**. Template from awesome-Implicit-NeRF-Robotics Please feel free to send me pull requests or email to add papers! If you find this repository useful, please consider citing and STARing this list. Feel free to share this list with others! ## Overview * Surveys * Reasoning * Planning * Manipulation * Instructions and Navigation * Simulation Frameworks * Citation
awesome-LLM-game-agent-papers
This repository provides a comprehensive survey of research papers on large language model (LLM)-based game agents. LLMs are powerful AI models that can understand and generate human language, and they have shown great promise for developing intelligent game agents. This survey covers a wide range of topics, including adventure games, crafting and exploration games, simulation games, competition games, cooperation games, communication games, and action games. For each topic, the survey provides an overview of the state-of-the-art research, as well as a discussion of the challenges and opportunities for future work.
LLM-Agents-Papers
A repository that lists papers related to Large Language Model (LLM) based agents. The repository covers various topics including survey, planning, feedback & reflection, memory mechanism, role playing, game playing, tool usage & human-agent interaction, benchmark & evaluation, environment & platform, agent framework, multi-agent system, and agent fine-tuning. It provides a comprehensive collection of research papers on LLM-based agents, exploring different aspects of AI agent architectures and applications.
awesome-llm-attributions
This repository focuses on unraveling the sources that large language models tap into for attribution or citation. It delves into the origins of facts, their utilization by the models, the efficacy of attribution methodologies, and challenges tied to ambiguous knowledge reservoirs, biases, and pitfalls of excessive attribution.
AI-System-School
AI System School is a curated list of research in machine learning systems, focusing on ML/DL infra, LLM infra, domain-specific infra, ML/LLM conferences, and general resources. It provides resources such as data processing, training systems, video systems, autoML systems, and more. The repository aims to help users navigate the landscape of AI systems and machine learning infrastructure, offering insights into conferences, surveys, books, videos, courses, and blogs related to the field.
awesome-AIOps
awesome-AIOps is a curated list of academic researches and industrial materials related to Artificial Intelligence for IT Operations (AIOps). It includes resources such as competitions, white papers, blogs, tutorials, benchmarks, tools, companies, academic materials, talks, workshops, papers, and courses covering various aspects of AIOps like anomaly detection, root cause analysis, incident management, microservices, dependency tracing, and more.
For similar tasks
Efficient_Foundation_Model_Survey
Efficient Foundation Model Survey is a comprehensive analysis of resource-efficient large language models (LLMs) and multimodal foundation models. The survey covers algorithmic and systemic innovations to support the growth of large models in a scalable and environmentally sustainable way. It explores cutting-edge model architectures, training/serving algorithms, and practical system designs. The goal is to provide insights on tackling resource challenges posed by large foundation models and inspire future breakthroughs in the field.
multipack_sampler
The Multipack sampler is a tool designed for padding-free distributed training of large language models. It optimizes batch processing efficiency using an approximate solution to the identical machine scheduling problem. The V2 update further enhances the packing algorithm complexity, achieving better throughput for a large number of nodes. It includes two variants for models with different attention types, aiming to balance sequence lengths and optimize packing efficiency. Users can refer to the provided benchmark for evaluating efficiency, utilization, and L^2 lag. The tool is compatible with PyTorch DataLoader and is released under the MIT license.
ml-engineering
This repository provides a comprehensive collection of methodologies, tools, and step-by-step instructions for successful training of large language models (LLMs) and multi-modal models. It is a technical resource suitable for LLM/VLM training engineers and operators, containing numerous scripts and copy-n-paste commands to facilitate quick problem-solving. The repository is an ongoing compilation of the author's experiences training BLOOM-176B and IDEFICS-80B models, and currently focuses on the development and training of Retrieval Augmented Generation (RAG) models at Contextual.AI. The content is organized into six parts: Insights, Hardware, Orchestration, Training, Development, and Miscellaneous. It includes key comparison tables for high-end accelerators and networks, as well as shortcuts to frequently needed tools and guides. The repository is open to contributions and discussions, and is licensed under Attribution-ShareAlike 4.0 International.
distributed-llama
Distributed Llama is a tool that allows you to run large language models (LLMs) on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage. It uses TCP sockets to synchronize the state of the neural network, and you can easily configure your AI cluster by using a home router. Distributed Llama supports models such as Llama 2 (7B, 13B, 70B) chat and non-chat versions, Llama 3, and Grok-1 (314B).
Awesome-LLMs-for-Video-Understanding
Awesome-LLMs-for-Video-Understanding is a repository dedicated to exploring Video Understanding with Large Language Models. It provides a comprehensive survey of the field, covering models, pretraining, instruction tuning, and hybrid methods. The repository also includes information on tasks, datasets, and benchmarks related to video understanding. Contributors are encouraged to add new papers, projects, and materials to enhance the repository.
Awesome-LLM
Awesome-LLM is a curated list of resources related to large language models, focusing on papers, projects, frameworks, tools, tutorials, courses, opinions, and other useful resources in the field. It covers trending LLM projects, milestone papers, other papers, open LLM projects, LLM training frameworks, LLM evaluation frameworks, tools for deploying LLM, prompting libraries & tools, tutorials, courses, books, and opinions. The repository provides a comprehensive overview of the latest advancements and resources in the field of large language models.
MotionLLM
MotionLLM is a framework for human behavior understanding that leverages Large Language Models (LLMs) to jointly model videos and motion sequences. It provides a unified training strategy, dataset MoVid, and MoVid-Bench for evaluating human behavior comprehension. The framework excels in captioning, spatial-temporal comprehension, and reasoning abilities.
LLMGA
LLMGA (Multimodal Large Language Model-based Generation Assistant) is a tool that leverages Large Language Models (LLMs) to assist users in image generation and editing. It provides detailed language generation prompts for precise control over Stable Diffusion (SD), resulting in more intricate and precise content in generated images. The tool curates a dataset for prompt refinement, similar image generation, inpainting & outpainting, and visual question answering. It offers a two-stage training scheme to optimize SD alignment and a reference-based restoration network to alleviate texture, brightness, and contrast disparities in image editing. LLMGA shows promising generative capabilities and enables wider applications in an interactive manner.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.