Awesome-LLMs-on-device

Awesome LLMs on Device: A Comprehensive Survey

Stars: 747

Visit

Welcome to the ultimate hub for on-device Large Language Models (LLMs)! This repository is your go-to resource for all things related to LLMs designed for on-device deployment. Whether you're a seasoned researcher, an innovative developer, or an enthusiastic learner, this comprehensive collection of cutting-edge knowledge is your gateway to understanding, leveraging, and contributing to the exciting world of on-device LLMs.

README:

🚀 Awesome LLMs on Device: A Must-Read Comprehensive Hub by Nexa AI

On-device Model Hub / Nexa SDK Documentation

Summary of On-device LLMs’ Evolution

🌟 About This Hub

🚀 Why This Hub is a Must-Read

📊 Comprehensive overview of on-device LLM evolution with easy-to-understand visualizations
🧠 In-depth analysis of groundbreaking architectures and optimization techniques
📱 Curated list of state-of-the-art models and frameworks ready for on-device deployment
💡 Practical examples and case studies to inspire your next project
🔄 Regular updates to keep you at the forefront of rapid advancements in the field
🤝 Active community of researchers and practitioners sharing insights and experiences

📚 What's Inside Our Hub

Awesome LLMs on Device: A Comprehensive Survey
Contents
Tutorials and Learning Resources
Citation

Foundations and Preliminaries

Evolution of On-Device LLMs

Tinyllama: An open-source small language model
arXiv 2024 [Paper] [Github]
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
arXiv 2024 [Paper] [Github]
MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases
arXiv 2024 [Paper]
Octopus series papers
arXiv 2024 [Octopus] [Octopus v2] [Octopus v3] [Octopus v4] [Github]
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
arXiv 2024 [Paper]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
arXiv 2023 [Paper] [Github]

LLM Architecture Foundations

The case for 4-bit precision: k-bit inference scaling laws
ICML 2023 [Paper]
Challenges and applications of large language models
arXiv 2023 [Paper]
MiniLLM: Knowledge distillation of large language models
ICLR 2023 [Paper] [github]
Gptq: Accurate post-training quantization for generative pre-trained transformers
ICLR 2023 [Paper] [Github]
Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale
NeurIPS 2022 [Paper]

On-Device LLMs Training

OpenELM: An Efficient Language Model Family with Open Training and Inference Framework
ICML 2024 [Paper] [Github]

Limitations of Cloud-Based LLM Inference and Advantages of On-Device Inference

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
arXiv 2024 [Paper]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
arXiv 2024 [Paper]
Exploring post-training quantization in llms from comprehensive study to low rank compensation
AAAI 2024 [Paper]
Matrix compression via randomized low rank and low precision factorization
NeurIPS 2023 [Paper] [Github]

The Performance Indicator of On-Device LLMs

MNN: A lightweight deep neural network inference engine
2024 [Github]
PowerInfer-2: Fast Large Language Model Inference on a Smartphone
arXiv 2024 [Paper] [Github]
llama.cpp: Lightweight library for Approximate Nearest Neighbors and Maximum Inner Product Search
2023 [Github]
Powerinfer: Fast large language model serving with a consumer-grade gpu
arXiv 2023 [Paper] [Github]
mllm: Fast and lightweight multimodal LLM inference engine for mobile and edge devices
2023 [Github]

Efficient Architectures for On-Device LLMs

Model	Performance	Computational Efficiency	Memory Requirements
MobileLLM	High accuracy, optimized for sub-billion parameter models	Embedding sharing, grouped-query attention	Reduced model size due to deep and thin structures
EdgeShard	Up to 50% latency reduction, 2× throughput improvement	Collaborative edge-cloud computing, optimal shard placement	Distributed model components reduce individual device load
LLMCad	Up to 9.3× speedup in token generation	Generate-then-verify, token tree generation	Smaller LLM for token generation, larger LLM for verification
Any-Precision LLM	Supports multiple precisions efficiently	Post-training quantization, memory-efficient design	Substantial memory savings with versatile model precisions
Breakthrough Memory	Up to 4.5× performance improvement	PIM and PNM technologies enhance memory processing	Enhanced memory bandwidth and capacity
MELTing Point	Provides systematic performance evaluation	Analyzes impacts of quantization, efficient model evaluation	Evaluates memory and computational efficiency trade-offs
LLMaaS on device	Reduces context switching latency significantly	Stateful execution, fine-grained KV cache compression	Efficient memory management with tolerance-aware compression and swapping
LocMoE	Reduces training time per epoch by up to 22.24%	Orthogonal gating weights, locality-based expert regularization	Minimizes communication overhead with group-wise All-to-All and recompute pipeline
EdgeMoE	Significant performance improvements on edge devices	Expert-wise bitwidth adaptation, preloading experts	Efficient memory management through expert-by-expert computation reordering
JetMoE	Outperforms Llama27B and 13B-Chat with fewer parameters	Reduces inference computation by 70% using sparse activation	8B total parameters, only 2B activated per input token
Pangu-$`\pi`$ Pro	Neural architecture, parameter initialization, and optimization strategy for billion-level parameter models	Embedding sharing, tokenizer compression	Reduced model size via architecture tweaking

Model Compression and Parameter Sharing

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
arXiv 2024 [Paper] [Github]
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
arXiv 2024 [Paper] [Github]

Collaborative and Hierarchical Model Approaches

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing
arXiv 2024 [Paper]
Llmcad: Fast and scalable on-device large language model inference
arXiv 2023 [Paper]

Memory and Computational Efficiency

The Breakthrough Memory Solutions for Improved Performance on LLM Inference
IEEE Micro 2024 [Paper]
MELTing point: Mobile Evaluation of Language Transformers
arXiv 2024 [Paper] [Github]

Mixture-of-Experts (MoE) Architectures

LLM as a system service on mobile devices
arXiv 2024 [Paper]
Locmoe: A low-overhead moe for large language model training
arXiv 2024 [Paper]
Edgemoe: Fast on-device inference of moe-based large language models
arXiv 2023 [Paper]

General Efficiency and Performance Improvements

Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
arXiv 2024 [Paper] [Github]
On the viability of using llms for sw/hw co-design: An example in designing cim dnn accelerators
IEEE SOCC 2023 [Paper]

Model Compression and Optimization Techniques for On-Device LLMs

Quantization

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
arXiv 2024 [Paper]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
arXiv 2024 [Paper] [Github]
Gptq: Accurate post-training quantization for generative pre-trained transformers
ICLR 2023 [Paper] [Github]
Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale
NeurIPS 2022 [Paper]

Pruning

Challenges and applications of large language models
arXiv 2023 [Paper]

Knowledge Distillation

MiniLLM: Knowledge distillation of large language models
ICLR 2024 [Paper]

Low-Rank Factorization

Exploring post-training quantization in llms from comprehensive study to low rank compensation
AAAI 2024 [Paper]
Matrix compression via randomized low rank and low precision factorization
NeurIPS 2023 [Paper] [Github]

Hardware Acceleration and Deployment Strategies

Popular On-Device LLMs Framework

llama.cpp: A lightweight library for efficient LLM inference on various hardware with minimal setup. [Github]
MNN: A blazing fast, lightweight deep learning framework. [Github]
PowerInfer: A CPU/GPU LLM inference engine leveraging activation locality for device. [Github]
ExecuTorch: A platform for On-device AI across mobile, embedded and edge for PyTorch. [Github]
MediaPipe: A suite of tools and libraries, enables quick application of AI and ML techniques. [Github]
MLC-LLM: A machine learning compiler and high-performance deployment engine for large language models. [Github]
VLLM: A fast and easy-to-use library for LLM inference and serving. [Github]
OpenLLM: An open platform for operating large language models (LLMs) in production. [Github]

Hardware Acceleration

The Breakthrough Memory Solutions for Improved Performance on LLM Inference
IEEE Micro 2024 [Paper]
Aquabolt-XL: Samsung HBM2-PIM with in-memory processing for ML accelerators and beyond
IEEE Hot Chips 2021 [Paper]

Applications

Text Generating For Messaging: Gboard smart reply
Translation: LLMCad
Meeting Summarizing
Healthcare application: BioMistral-7B, HuatuoGPT
Research Support
Companion Robot
Disability Support: Octopus v3, Talkback with Gemini Nano
Autonomous Vehicles: DriveVLM

Model Reference

Model	Institute	Paper
Gemini Nano	Google	Gemini: A Family of Highly Capable Multimodal Models
Octopus series model	Nexa AI	Octopus v2: On-device language model for super agent Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent Octopus v4: Graph of language models Octopus: On-device language model for function calling of software APIs
OpenELM and Ferret-v2	Apple	OpenELM is a significant large language model integrated within iOS to enhance application functionalities. Ferret-v2 significantly improves upon its predecessor, introducing enhanced visual processing capabilities and an advanced training regimen.
Phi series	Microsoft	Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
MiniCPM	Tsinghua University	A GPT-4V Level Multimodal LLM on Your Phone
Gemma2-9B	Google	Gemma 2: Improving Open Language Models at a Practical Size
Qwen2-0.5B	Alibaba Group	Qwen Technical Report

Tutorials and Learning Resources

MIT: TinyML and Efficient Deep Learning Computing
Harvard: Machine Learning Systems
Deep Learning AI : Introduction to on-device AI

🤝 Join the On-Device LLM Revolution

We believe in the power of community! If you're passionate about on-device AI and want to contribute to this ever-growing knowledge hub, here's how you can get involved:

Fork the repository
Create a new branch for your brilliant additions
Make your updates and push your changes
Submit a pull request and become part of the on-device LLM movement

⭐ Star History ⭐

📖 Cite Our Work

If our hub fuels your research or powers your projects, we'd be thrilled if you could cite our paper here:

@article{xu2024device,
  title={On-Device Language Models: A Comprehensive Review},
  author={Xu, Jiajun and Li, Zhiyuan and Chen, Wei and Wang, Qun and Gao, Xin and Cai, Qi and Ling, Ziyuan},
  journal={arXiv preprint arXiv:2409.00088},
  year={2024}
}

📄 License

This project is open-source and available under the MIT License. See the LICENSE file for more details.

Don't just read about the future of AI – be part of it. Star this repo, spread the word, and let's push the boundaries of on-device LLMs together! 🚀🌟

For Tasks:

Click tags to check more tools for each tasks

analyze model evolution optimize model efficiency deploy on edge devices compress model parameters accelerate model inference

For Jobs:

researcher developer data scientist machine learning engineer ai architect

Alternative AI tools for Awesome-LLMs-on-device

Similar Open Source Tools

Awesome-LLMs-on-device

github

: 747

Awesome_Mamba

Awesome Mamba is a curated collection of groundbreaking research papers and articles on Mamba Architecture, a pioneering framework in deep learning known for its selective state spaces and efficiency in processing complex data structures. The repository offers a comprehensive exploration of Mamba architecture through categorized research papers covering various domains like visual recognition, speech processing, remote sensing, video processing, activity recognition, image enhancement, medical imaging, reinforcement learning, natural language processing, 3D recognition, multi-modal understanding, time series analysis, graph neural networks, point cloud analysis, and tabular data handling.

github

: 125

ColossalAI

Colossal-AI is a deep learning system for large-scale parallel training. It provides a unified interface to scale sequential code of model training to distributed environments. Colossal-AI supports parallel training methods such as data, pipeline, tensor, and sequence parallelism and is integrated with heterogeneous training and zero redundancy optimizer.

github

: 40.3k

awesome-weather-models

A catalogue and categorization of AI-based weather forecasting models. This page provides a catalogue and categorization of AI-based weather forecasting models to enable discovery and comparison of different available model options. The weather models are categorized based on metadata found in the JSON schema specification. The table includes information such as the name of the weather model, the organization that developed it, operational data availability, open-source status, and links for further details.

github

: 99

Pai-Megatron-Patch

Pai-Megatron-Patch is a deep learning training toolkit built for developers to train and predict LLMs & VLMs by using Megatron framework easily. With the continuous development of LLMs, the model structure and scale are rapidly evolving. Although these models can be conveniently manufactured using Transformers or DeepSpeed training framework, the training efficiency is comparably low. This phenomenon becomes even severer when the model scale exceeds 10 billion. The primary objective of Pai-Megatron-Patch is to effectively utilize the computational power of GPUs for LLM. This tool allows convenient training of commonly used LLM with all the accelerating techniques provided by Megatron-LM.

github

: 923

awesome-MLSecOps

Awesome MLSecOps is a curated list of open-source tools, resources, and tutorials for MLSecOps (Machine Learning Security Operations). It includes a wide range of security tools and libraries for protecting machine learning models against adversarial attacks, as well as resources for AI security, data anonymization, model security, and more. The repository aims to provide a comprehensive collection of tools and information to help users secure their machine learning systems and infrastructure.

github

: 204

ReST-MCTS

ReST-MCTS is a reinforced self-training approach that integrates process reward guidance with tree search MCTS to collect higher-quality reasoning traces and per-step value for training policy and reward models. It eliminates the need for manual per-step annotation by estimating the probability of steps leading to correct answers. The inferred rewards refine the process reward model and aid in selecting high-quality traces for policy model self-training.

github

: 452

inference

Xorbits Inference (Xinference) is a powerful and versatile library designed to serve language, speech recognition, and multimodal models. With Xorbits Inference, you can effortlessly deploy and serve your or state-of-the-art built-in models using just a single command. Whether you are a researcher, developer, or data scientist, Xorbits Inference empowers you to unleash the full potential of cutting-edge AI models.

github

: 7.4k

SoM-LLaVA

SoM-LLaVA is a new data source and learning paradigm for Multimodal LLMs, empowering open-source Multimodal LLMs with Set-of-Mark prompting and improved visual reasoning ability. The repository provides a new dataset that is complementary to existing training sources, enhancing multimodal LLMs with Set-of-Mark prompting and improved general capacity. By adding 30k SoM data to the visual instruction tuning stage of LLaVA, the tool achieves 1% to 6% relative improvements on all benchmarks. Users can train SoM-LLaVA via command line and utilize the implementation to annotate COCO images with SoM. Additionally, the tool can be loaded in Huggingface for further usage.

github

: 92

nuitrack-sdk

Nuitrack™ is an ultimate 3D body tracking solution developed by 3DiVi Inc. It enables body motion analytics applications for virtually any widespread depth sensors and hardware platforms, supporting a wide range of applications from real-time gesture recognition on embedded platforms to large-scale multisensor analytical systems. Nuitrack provides highly-sophisticated 3D skeletal tracking, basic facial analysis, hand tracking, and gesture recognition APIs for UI control. It offers two skeletal tracking engines: classical for embedded hardware and AI for complex poses, providing a human-centric spatial understanding tool for natural and intelligent user engagement.

github

: 489

buffer-of-thought-llm

Buffer of Thoughts (BoT) is a thought-augmented reasoning framework designed to enhance the accuracy, efficiency, and robustness of large language models (LLMs). It introduces a meta-buffer to store high-level thought-templates distilled from problem-solving processes, enabling adaptive reasoning for efficient problem-solving. The framework includes a buffer-manager to dynamically update the meta-buffer, ensuring scalability and stability. BoT achieves significant performance improvements on reasoning-intensive tasks and demonstrates superior generalization ability and robustness while being cost-effective compared to other methods.

github

: 341

MiniCPM-V

MiniCPM-V is a series of end-side multimodal LLMs designed for vision-language understanding. The models take image and text inputs to provide high-quality text outputs. The series includes models like MiniCPM-Llama3-V 2.5 with 8B parameters surpassing proprietary models, and MiniCPM-V 2.0, a lighter model with 2B parameters. The models support over 30 languages, efficient deployment on end-side devices, and have strong OCR capabilities. They achieve state-of-the-art performance on various benchmarks and prevent hallucinations in text generation. The models can process high-resolution images efficiently and support multilingual capabilities.

github

: 8.2k

AIOS

AIOS, a Large Language Model (LLM) Agent operating system, embeds large language model into Operating Systems (OS) as the brain of the OS, enabling an operating system "with soul" -- an important step towards AGI. AIOS is designed to optimize resource allocation, facilitate context switch across agents, enable concurrent execution of agents, provide tool service for agents, maintain access control for agents, and provide a rich set of toolkits for LLM Agent developers.

github

: 4.0k

SimAI

SimAI is the industry's first full-stack, high-precision simulator for AI large-scale training. It provides detailed modeling and simulation of the entire LLM training process, encompassing framework, collective communication, network layers, and more. This comprehensive approach offers end-to-end performance data, enabling researchers to analyze training process details, evaluate time consumption of AI tasks under specific conditions, and assess performance gains from various algorithmic optimizations.

github

: 281

AReaL

AReaL (Ant Reasoning RL) is an open-source reinforcement learning system developed at the RL Lab, Ant Research. It is designed for training Large Reasoning Models (LRMs) in a fully open and inclusive manner. AReaL provides reproducible experiments for 1.5B and 7B LRMs, showcasing its scalability and performance across diverse computational budgets. The system follows an iterative training process to enhance model performance, with a focus on mathematical reasoning tasks. AReaL is equipped to adapt to different computational resource settings, enabling users to easily configure and launch training trials. Future plans include support for advanced models, optimizations for distributed training, and exploring research topics to enhance LRMs' reasoning capabilities.

github

: 538

fastRAG

fastRAG is a research framework designed to build and explore efficient retrieval-augmented generative models. It incorporates state-of-the-art Large Language Models (LLMs) and Information Retrieval to empower researchers and developers with a comprehensive tool-set for advancing retrieval augmented generation. The framework is optimized for Intel hardware, customizable, and includes key features such as optimized RAG pipelines, efficient components, and RAG-efficient components like ColBERT and Fusion-in-Decoder (FiD). fastRAG supports various unique components and backends for running LLMs, making it a versatile tool for research and development in the field of retrieval-augmented generation.

github

: 1.3k

For similar tasks

Chinese-Mixtral-8x7B

Chinese-Mixtral-8x7B is an open-source project based on Mistral's Mixtral-8x7B model for incremental pre-training of Chinese vocabulary, aiming to advance research on MoE models in the Chinese natural language processing community. The expanded vocabulary significantly improves the model's encoding and decoding efficiency for Chinese, and the model is pre-trained incrementally on a large-scale open-source corpus, enabling it with powerful Chinese generation and comprehension capabilities. The project includes a large model with expanded Chinese vocabulary and incremental pre-training code.

github

: 635

Awesome-LLMs-on-device

github

: 747

ZhiLight

ZhiLight is a highly optimized large language model (LLM) inference engine developed by Zhihu and ModelBest Inc. It accelerates the inference of models like Llama and its variants, especially on PCIe-based GPUs. ZhiLight offers significant performance advantages compared to mainstream open-source inference engines. It supports various features such as custom defined tensor and unified global memory management, optimized fused kernels, support for dynamic batch, flash attention prefill, prefix cache, and different quantization techniques like INT8, SmoothQuant, FP8, AWQ, and GPTQ. ZhiLight is compatible with OpenAI interface and provides high performance on mainstream NVIDIA GPUs with different model sizes and precisions.

github

: 832

AIInfra

AIInfra is an open-source project focused on AI infrastructure, specifically targeting large models in distributed clusters, distributed architecture, distributed training, and algorithms related to large models. The project aims to explore and study system design in artificial intelligence and deep learning, with a focus on the hardware and software stack for building AI large model systems. It provides a comprehensive curriculum covering topics such as AI chip principles, communication and storage, AI clusters, large model training, and inference, as well as algorithms for large models. The course is designed for undergraduate and graduate students, as well as professionals working with AI large model systems, to gain a deep understanding of AI computer system architecture and design.

github

: 1.5k

Awesome-LLM-Quantization

Awesome-LLM-Quantization is a curated list of resources related to quantization techniques for Large Language Models (LLMs). Quantization is a crucial step in deploying LLMs on resource-constrained devices, such as mobile phones or edge devices, by reducing the model's size and computational requirements.

github

: 137

For similar jobs

Awesome_Mamba

github

: 125

unilm

The 'unilm' repository is a collection of tools, models, and architectures for Foundation Models and General AI, focusing on tasks such as NLP, MT, Speech, Document AI, and Multimodal AI. It includes various pre-trained models, such as UniLM, InfoXLM, DeltaLM, MiniLM, AdaLM, BEiT, LayoutLM, WavLM, VALL-E, and more, designed for tasks like language understanding, generation, translation, vision, speech, and multimodal processing. The repository also features toolkits like s2s-ft for sequence-to-sequence fine-tuning and Aggressive Decoding for efficient sequence-to-sequence decoding. Additionally, it offers applications like TrOCR for OCR, LayoutReader for reading order detection, and XLM-T for multilingual NMT.

github

: 19.6k

llm-app-stack

LLM App Stack, also known as Emerging Architectures for LLM Applications, is a comprehensive list of available tools, projects, and vendors at each layer of the LLM app stack. It covers various categories such as Data Pipelines, Embedding Models, Vector Databases, Playgrounds, Orchestrators, APIs/Plugins, LLM Caches, Logging/Monitoring/Eval, Validators, LLM APIs (proprietary and open source), App Hosting Platforms, Cloud Providers, and Opinionated Clouds. The repository aims to provide a detailed overview of tools and projects for building, deploying, and maintaining enterprise data solutions, AI models, and applications.

github

: 929

awesome-deeplogic

Awesome deep logic is a curated list of papers and resources focusing on integrating symbolic logic into deep neural networks. It includes surveys, tutorials, and research papers that explore the intersection of logic and deep learning. The repository aims to provide valuable insights and knowledge on how logic can be used to enhance reasoning, knowledge regularization, weak supervision, and explainability in neural networks.

github

: 214

Awesome-LLMs-on-device

github

: 747

duo-attention

DuoAttention is a framework designed to optimize long-context large language models (LLMs) by reducing memory and latency during inference without compromising their long-context abilities. It introduces a concept of Retrieval Heads and Streaming Heads to efficiently manage attention across tokens. By applying a full Key and Value (KV) cache to retrieval heads and a lightweight, constant-length KV cache to streaming heads, DuoAttention achieves significant reductions in memory usage and decoding time for LLMs. The framework uses an optimization-based algorithm with synthetic data to accurately identify retrieval heads, enabling efficient inference with minimal accuracy loss compared to full attention. DuoAttention also supports quantization techniques for further memory optimization, allowing for decoding of up to 3.3 million tokens on a single GPU.

github

: 57

llm_note

LLM notes repository contains detailed analysis on transformer models, language model compression, inference and deployment, high-performance computing, and system optimization methods. It includes discussions on various algorithms, frameworks, and performance analysis related to large language models and high-performance computing. The repository serves as a comprehensive resource for understanding and optimizing language models and computing systems.

github

: 686

Awesome-Resource-Efficient-LLM-Papers

A curated list of high-quality papers on resource-efficient Large Language Models (LLMs) with a focus on various aspects such as architecture design, pre-training, fine-tuning, inference, system design, and evaluation metrics. The repository covers topics like efficient transformer architectures, non-transformer architectures, memory efficiency, data efficiency, model compression, dynamic acceleration, deployment optimization, support infrastructure, and other related systems. It also provides detailed information on computation metrics, memory metrics, energy metrics, financial cost metrics, network communication metrics, and other metrics relevant to resource-efficient LLMs. The repository includes benchmarks for evaluating the efficiency of NLP models and references for further reading.

github

: 105