Awesome-LLMs-on-device
Awesome LLMs on Device: A Comprehensive Survey
Stars: 747
Welcome to the ultimate hub for on-device Large Language Models (LLMs)! This repository is your go-to resource for all things related to LLMs designed for on-device deployment. Whether you're a seasoned researcher, an innovative developer, or an enthusiastic learner, this comprehensive collection of cutting-edge knowledge is your gateway to understanding, leveraging, and contributing to the exciting world of on-device LLMs.
README:
Welcome to the ultimate hub for on-device Large Language Models (LLMs)! This repository is your go-to resource for all things related to LLMs designed for on-device deployment. Whether you're a seasoned researcher, an innovative developer, or an enthusiastic learner, this comprehensive collection of cutting-edge knowledge is your gateway to understanding, leveraging, and contributing to the exciting world of on-device LLMs.
- π Comprehensive overview of on-device LLM evolution with easy-to-understand visualizations
- π§ In-depth analysis of groundbreaking architectures and optimization techniques
- π± Curated list of state-of-the-art models and frameworks ready for on-device deployment
- π‘ Practical examples and case studies to inspire your next project
- π Regular updates to keep you at the forefront of rapid advancements in the field
- π€ Active community of researchers and practitioners sharing insights and experiences
- Awesome LLMs on Device: A Comprehensive Survey
- Contents
- Tutorials and Learning Resources
- Citation
- Tinyllama: An open-source small language model
arXiv 2024 [Paper] [Github] - MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
arXiv 2024 [Paper] [Github] - MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases
arXiv 2024 [Paper] - Octopus series papers
arXiv 2024 [Octopus] [Octopus v2] [Octopus v3] [Octopus v4] [Github] - The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
arXiv 2024 [Paper] - AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
arXiv 2023 [Paper] [Github]
- The case for 4-bit precision: k-bit inference scaling laws
ICML 2023 [Paper] - Challenges and applications of large language models
arXiv 2023 [Paper] - MiniLLM: Knowledge distillation of large language models
ICLR 2023 [Paper] [github] - Gptq: Accurate post-training quantization for generative pre-trained transformers
ICLR 2023 [Paper] [Github] - Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale
NeurIPS 2022 [Paper]
- OpenELM: An Efficient Language Model Family with Open Training and Inference Framework
ICML 2024 [Paper] [Github]
- Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
arXiv 2024 [Paper] - Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
arXiv 2024 [Paper] - Exploring post-training quantization in llms from comprehensive study to low rank compensation
AAAI 2024 [Paper] - Matrix compression via randomized low rank and low precision factorization
NeurIPS 2023 [Paper] [Github]
- MNN: A lightweight deep neural network inference engine
2024 [Github] - PowerInfer-2: Fast Large Language Model Inference on a Smartphone
arXiv 2024 [Paper] [Github] - llama.cpp: Lightweight library for Approximate Nearest Neighbors and Maximum Inner Product Search
2023 [Github] - Powerinfer: Fast large language model serving with a consumer-grade gpu
arXiv 2023 [Paper] [Github] - mllm: Fast and lightweight multimodal LLM inference engine for mobile and edge devices
2023 [Github]
Model | Performance | Computational Efficiency | Memory Requirements |
---|---|---|---|
MobileLLM | High accuracy, optimized for sub-billion parameter models | Embedding sharing, grouped-query attention | Reduced model size due to deep and thin structures |
EdgeShard | Up to 50% latency reduction, 2Γ throughput improvement | Collaborative edge-cloud computing, optimal shard placement | Distributed model components reduce individual device load |
LLMCad | Up to 9.3Γ speedup in token generation | Generate-then-verify, token tree generation | Smaller LLM for token generation, larger LLM for verification |
Any-Precision LLM | Supports multiple precisions efficiently | Post-training quantization, memory-efficient design | Substantial memory savings with versatile model precisions |
Breakthrough Memory | Up to 4.5Γ performance improvement | PIM and PNM technologies enhance memory processing | Enhanced memory bandwidth and capacity |
MELTing Point | Provides systematic performance evaluation | Analyzes impacts of quantization, efficient model evaluation | Evaluates memory and computational efficiency trade-offs |
LLMaaS on device | Reduces context switching latency significantly | Stateful execution, fine-grained KV cache compression | Efficient memory management with tolerance-aware compression and swapping |
LocMoE | Reduces training time per epoch by up to 22.24% | Orthogonal gating weights, locality-based expert regularization | Minimizes communication overhead with group-wise All-to-All and recompute pipeline |
EdgeMoE | Significant performance improvements on edge devices | Expert-wise bitwidth adaptation, preloading experts | Efficient memory management through expert-by-expert computation reordering |
JetMoE | Outperforms Llama27B and 13B-Chat with fewer parameters | Reduces inference computation by 70% using sparse activation | 8B total parameters, only 2B activated per input token |
Pangu-$\pi $ Pro |
Neural architecture, parameter initialization, and optimization strategy for billion-level parameter models | Embedding sharing, tokenizer compression | Reduced model size via architecture tweaking |
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
arXiv 2024 [Paper] [Github] - MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
arXiv 2024 [Paper] [Github]
- EdgeShard: Efficient LLM Inference via Collaborative Edge Computing
arXiv 2024 [Paper] - Llmcad: Fast and scalable on-device large language model inference
arXiv 2023 [Paper]
- The Breakthrough Memory Solutions for Improved Performance on LLM Inference
IEEE Micro 2024 [Paper] - MELTing point: Mobile Evaluation of Language Transformers
arXiv 2024 [Paper] [Github]
- LLM as a system service on mobile devices
arXiv 2024 [Paper] - Locmoe: A low-overhead moe for large language model training
arXiv 2024 [Paper] - Edgemoe: Fast on-device inference of moe-based large language models
arXiv 2023 [Paper]
- Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
arXiv 2024 [Paper] [Github] - On the viability of using llms for sw/hw co-design: An example in designing cim dnn accelerators
IEEE SOCC 2023 [Paper]
- The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
arXiv 2024 [Paper] - AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
arXiv 2024 [Paper] [Github] - Gptq: Accurate post-training quantization for generative pre-trained transformers
ICLR 2023 [Paper] [Github] - Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale
NeurIPS 2022 [Paper]
- Challenges and applications of large language models
arXiv 2023 [Paper]
- MiniLLM: Knowledge distillation of large language models
ICLR 2024 [Paper]
- Exploring post-training quantization in llms from comprehensive study to low rank compensation
AAAI 2024 [Paper] - Matrix compression via randomized low rank and low precision factorization
NeurIPS 2023 [Paper] [Github]
- llama.cpp: A lightweight library for efficient LLM inference on various hardware with minimal setup. [Github]
- MNN: A blazing fast, lightweight deep learning framework. [Github]
- PowerInfer: A CPU/GPU LLM inference engine leveraging activation locality for device. [Github]
- ExecuTorch: A platform for On-device AI across mobile, embedded and edge for PyTorch. [Github]
- MediaPipe: A suite of tools and libraries, enables quick application of AI and ML techniques. [Github]
- MLC-LLM: A machine learning compiler and high-performance deployment engine for large language models. [Github]
- VLLM: A fast and easy-to-use library for LLM inference and serving. [Github]
- OpenLLM: An open platform for operating large language models (LLMs) in production. [Github]
- The Breakthrough Memory Solutions for Improved Performance on LLM Inference
IEEE Micro 2024 [Paper] - Aquabolt-XL: Samsung HBM2-PIM with in-memory processing for ML accelerators and beyond
IEEE Hot Chips 2021 [Paper]
- Text Generating For Messaging: Gboard smart reply
- Translation: LLMCad
- Meeting Summarizing
- Healthcare application: BioMistral-7B, HuatuoGPT
- Research Support
- Companion Robot
- Disability Support: Octopus v3, Talkback with Gemini Nano
- Autonomous Vehicles: DriveVLM
Model | Institute | Paper |
---|---|---|
Gemini Nano | Gemini: A Family of Highly Capable Multimodal Models | |
Octopus series model | Nexa AI |
Octopus v2: On-device language model for super agent Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent Octopus v4: Graph of language models Octopus: On-device language model for function calling of software APIs |
OpenELM and Ferret-v2 | Apple |
OpenELM is a significant large language model integrated within iOS to enhance application functionalities. Ferret-v2 significantly improves upon its predecessor, introducing enhanced visual processing capabilities and an advanced training regimen. |
Phi series | Microsoft | Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone |
MiniCPM | Tsinghua University | A GPT-4V Level Multimodal LLM on Your Phone |
Gemma2-9B | Gemma 2: Improving Open Language Models at a Practical Size | |
Qwen2-0.5B | Alibaba Group | Qwen Technical Report |
- MIT: TinyML and Efficient Deep Learning Computing
- Harvard: Machine Learning Systems
- Deep Learning AI : Introduction to on-device AI
We believe in the power of community! If you're passionate about on-device AI and want to contribute to this ever-growing knowledge hub, here's how you can get involved:
- Fork the repository
- Create a new branch for your brilliant additions
- Make your updates and push your changes
- Submit a pull request and become part of the on-device LLM movement
If our hub fuels your research or powers your projects, we'd be thrilled if you could cite our paper here:
@article{xu2024device,
title={On-Device Language Models: A Comprehensive Review},
author={Xu, Jiajun and Li, Zhiyuan and Chen, Wei and Wang, Qun and Gao, Xin and Cai, Qi and Ling, Ziyuan},
journal={arXiv preprint arXiv:2409.00088},
year={2024}
}
This project is open-source and available under the MIT License. See the LICENSE file for more details.
Don't just read about the future of AI β be part of it. Star this repo, spread the word, and let's push the boundaries of on-device LLMs together! ππ
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for Awesome-LLMs-on-device
Similar Open Source Tools
Awesome-LLMs-on-device
Welcome to the ultimate hub for on-device Large Language Models (LLMs)! This repository is your go-to resource for all things related to LLMs designed for on-device deployment. Whether you're a seasoned researcher, an innovative developer, or an enthusiastic learner, this comprehensive collection of cutting-edge knowledge is your gateway to understanding, leveraging, and contributing to the exciting world of on-device LLMs.
ColossalAI
Colossal-AI is a deep learning system for large-scale parallel training. It provides a unified interface to scale sequential code of model training to distributed environments. Colossal-AI supports parallel training methods such as data, pipeline, tensor, and sequence parallelism and is integrated with heterogeneous training and zero redundancy optimizer.
Efficient-Multimodal-LLMs-Survey
Efficient Multimodal Large Language Models: A Survey provides a comprehensive review of efficient and lightweight Multimodal Large Language Models (MLLMs), focusing on model size reduction and cost efficiency for edge computing scenarios. The survey covers the timeline of efficient MLLMs, research on efficient structures and strategies, and their applications, while also discussing current limitations and future directions.
Efficient-Multimodal-LLMs-Survey
Efficient Multimodal Large Language Models: A Survey provides a comprehensive review of efficient and lightweight Multimodal Large Language Models (MLLMs), focusing on model size reduction and cost efficiency for edge computing scenarios. The survey covers the timeline of efficient MLLMs, research on efficient structures and strategies, and applications. It discusses current limitations and future directions in efficient MLLM research.
awesome-MLSecOps
Awesome MLSecOps is a curated list of open-source tools, resources, and tutorials for MLSecOps (Machine Learning Security Operations). It includes a wide range of security tools and libraries for protecting machine learning models against adversarial attacks, as well as resources for AI security, data anonymization, model security, and more. The repository aims to provide a comprehensive collection of tools and information to help users secure their machine learning systems and infrastructure.
VideoTuna
VideoTuna is a codebase for text-to-video applications that integrates multiple AI video generation models for text-to-video, image-to-video, and text-to-image generation. It provides comprehensive pipelines in video generation, including pre-training, continuous training, post-training, and fine-tuning. The models in VideoTuna include U-Net and DiT architectures for visual generation tasks, with upcoming releases of a new 3D video VAE and a controllable facial video generation model.
ReST-MCTS
ReST-MCTS is a reinforced self-training approach that integrates process reward guidance with tree search MCTS to collect higher-quality reasoning traces and per-step value for training policy and reward models. It eliminates the need for manual per-step annotation by estimating the probability of steps leading to correct answers. The inferred rewards refine the process reward model and aid in selecting high-quality traces for policy model self-training.
inference
Xorbits Inference (Xinference) is a powerful and versatile library designed to serve language, speech recognition, and multimodal models. With Xorbits Inference, you can effortlessly deploy and serve your or state-of-the-art built-in models using just a single command. Whether you are a researcher, developer, or data scientist, Xorbits Inference empowers you to unleash the full potential of cutting-edge AI models.
nuitrack-sdk
Nuitrackβ’ is an ultimate 3D body tracking solution developed by 3DiVi Inc. It enables body motion analytics applications for virtually any widespread depth sensors and hardware platforms, supporting a wide range of applications from real-time gesture recognition on embedded platforms to large-scale multisensor analytical systems. Nuitrack provides highly-sophisticated 3D skeletal tracking, basic facial analysis, hand tracking, and gesture recognition APIs for UI control. It offers two skeletal tracking engines: classical for embedded hardware and AI for complex poses, providing a human-centric spatial understanding tool for natural and intelligent user engagement.
buffer-of-thought-llm
Buffer of Thoughts (BoT) is a thought-augmented reasoning framework designed to enhance the accuracy, efficiency, and robustness of large language models (LLMs). It introduces a meta-buffer to store high-level thought-templates distilled from problem-solving processes, enabling adaptive reasoning for efficient problem-solving. The framework includes a buffer-manager to dynamically update the meta-buffer, ensuring scalability and stability. BoT achieves significant performance improvements on reasoning-intensive tasks and demonstrates superior generalization ability and robustness while being cost-effective compared to other methods.
MiniCPM-V
MiniCPM-V is a series of end-side multimodal LLMs designed for vision-language understanding. The models take image and text inputs to provide high-quality text outputs. The series includes models like MiniCPM-Llama3-V 2.5 with 8B parameters surpassing proprietary models, and MiniCPM-V 2.0, a lighter model with 2B parameters. The models support over 30 languages, efficient deployment on end-side devices, and have strong OCR capabilities. They achieve state-of-the-art performance on various benchmarks and prevent hallucinations in text generation. The models can process high-resolution images efficiently and support multilingual capabilities.
SimAI
SimAI is the industry's first full-stack, high-precision simulator for AI large-scale training. It provides detailed modeling and simulation of the entire LLM training process, encompassing framework, collective communication, network layers, and more. This comprehensive approach offers end-to-end performance data, enabling researchers to analyze training process details, evaluate time consumption of AI tasks under specific conditions, and assess performance gains from various algorithmic optimizations.
plate
Plate is a rich-text editor framework designed for simplicity and efficiency. It consists of core plugin system, various plugin packages, primitive hooks and components, and pre-built components. Plate offers templates for different use cases like Notion-like template, Plate playground template, and Plate minimal template. Users can refer to the documentation for more information on Plate. Contributors are welcome to join the project by giving stars, making pull requests, or sharing plugins.
HuatuoGPT-o1
HuatuoGPT-o1 is a medical language model designed for advanced medical reasoning. It can identify mistakes, explore alternative strategies, and refine answers. The model leverages verifiable medical problems and a specialized medical verifier to guide complex reasoning trajectories and enhance reasoning through reinforcement learning. The repository provides access to models, data, and code for HuatuoGPT-o1, allowing users to deploy the model for medical reasoning tasks.
Pallaidium
Pallaidium is a generative AI movie studio integrated into the Blender video editor. It allows users to AI-generate video, image, and audio from text prompts or existing media files. The tool provides various features such as text to video, text to audio, text to speech, text to image, image to image, image to video, video to video, image to text, and more. It requires a Windows system with a CUDA-supported Nvidia card and at least 6 GB VRAM. Pallaidium offers batch processing capabilities, text to audio conversion using Bark, and various performance optimization tips. Users can install the tool by downloading the add-on and following the installation instructions provided. The tool comes with a set of restrictions on usage, prohibiting the generation of harmful, pornographic, violent, or false content.
TokenPacker
TokenPacker is a novel visual projector that compresses visual tokens by 75%βΌ89% with high efficiency. It adopts a 'coarse-to-fine' scheme to generate condensed visual tokens, achieving comparable or better performance across diverse benchmarks. The tool includes TokenPacker for general use and TokenPacker-HD for high-resolution image understanding. It provides training scripts, checkpoints, and supports various compression ratios and patch numbers.
For similar tasks
Chinese-Mixtral-8x7B
Chinese-Mixtral-8x7B is an open-source project based on Mistral's Mixtral-8x7B model for incremental pre-training of Chinese vocabulary, aiming to advance research on MoE models in the Chinese natural language processing community. The expanded vocabulary significantly improves the model's encoding and decoding efficiency for Chinese, and the model is pre-trained incrementally on a large-scale open-source corpus, enabling it with powerful Chinese generation and comprehension capabilities. The project includes a large model with expanded Chinese vocabulary and incremental pre-training code.
Awesome-LLMs-on-device
Welcome to the ultimate hub for on-device Large Language Models (LLMs)! This repository is your go-to resource for all things related to LLMs designed for on-device deployment. Whether you're a seasoned researcher, an innovative developer, or an enthusiastic learner, this comprehensive collection of cutting-edge knowledge is your gateway to understanding, leveraging, and contributing to the exciting world of on-device LLMs.
ZhiLight
ZhiLight is a highly optimized large language model (LLM) inference engine developed by Zhihu and ModelBest Inc. It accelerates the inference of models like Llama and its variants, especially on PCIe-based GPUs. ZhiLight offers significant performance advantages compared to mainstream open-source inference engines. It supports various features such as custom defined tensor and unified global memory management, optimized fused kernels, support for dynamic batch, flash attention prefill, prefix cache, and different quantization techniques like INT8, SmoothQuant, FP8, AWQ, and GPTQ. ZhiLight is compatible with OpenAI interface and provides high performance on mainstream NVIDIA GPUs with different model sizes and precisions.
Awesome-LLM-Quantization
Awesome-LLM-Quantization is a curated list of resources related to quantization techniques for Large Language Models (LLMs). Quantization is a crucial step in deploying LLMs on resource-constrained devices, such as mobile phones or edge devices, by reducing the model's size and computational requirements.
For similar jobs
Awesome_Mamba
Awesome Mamba is a curated collection of groundbreaking research papers and articles on Mamba Architecture, a pioneering framework in deep learning known for its selective state spaces and efficiency in processing complex data structures. The repository offers a comprehensive exploration of Mamba architecture through categorized research papers covering various domains like visual recognition, speech processing, remote sensing, video processing, activity recognition, image enhancement, medical imaging, reinforcement learning, natural language processing, 3D recognition, multi-modal understanding, time series analysis, graph neural networks, point cloud analysis, and tabular data handling.
unilm
The 'unilm' repository is a collection of tools, models, and architectures for Foundation Models and General AI, focusing on tasks such as NLP, MT, Speech, Document AI, and Multimodal AI. It includes various pre-trained models, such as UniLM, InfoXLM, DeltaLM, MiniLM, AdaLM, BEiT, LayoutLM, WavLM, VALL-E, and more, designed for tasks like language understanding, generation, translation, vision, speech, and multimodal processing. The repository also features toolkits like s2s-ft for sequence-to-sequence fine-tuning and Aggressive Decoding for efficient sequence-to-sequence decoding. Additionally, it offers applications like TrOCR for OCR, LayoutReader for reading order detection, and XLM-T for multilingual NMT.
llm-app-stack
LLM App Stack, also known as Emerging Architectures for LLM Applications, is a comprehensive list of available tools, projects, and vendors at each layer of the LLM app stack. It covers various categories such as Data Pipelines, Embedding Models, Vector Databases, Playgrounds, Orchestrators, APIs/Plugins, LLM Caches, Logging/Monitoring/Eval, Validators, LLM APIs (proprietary and open source), App Hosting Platforms, Cloud Providers, and Opinionated Clouds. The repository aims to provide a detailed overview of tools and projects for building, deploying, and maintaining enterprise data solutions, AI models, and applications.
awesome-deeplogic
Awesome deep logic is a curated list of papers and resources focusing on integrating symbolic logic into deep neural networks. It includes surveys, tutorials, and research papers that explore the intersection of logic and deep learning. The repository aims to provide valuable insights and knowledge on how logic can be used to enhance reasoning, knowledge regularization, weak supervision, and explainability in neural networks.
Awesome-LLMs-on-device
Welcome to the ultimate hub for on-device Large Language Models (LLMs)! This repository is your go-to resource for all things related to LLMs designed for on-device deployment. Whether you're a seasoned researcher, an innovative developer, or an enthusiastic learner, this comprehensive collection of cutting-edge knowledge is your gateway to understanding, leveraging, and contributing to the exciting world of on-device LLMs.
duo-attention
DuoAttention is a framework designed to optimize long-context large language models (LLMs) by reducing memory and latency during inference without compromising their long-context abilities. It introduces a concept of Retrieval Heads and Streaming Heads to efficiently manage attention across tokens. By applying a full Key and Value (KV) cache to retrieval heads and a lightweight, constant-length KV cache to streaming heads, DuoAttention achieves significant reductions in memory usage and decoding time for LLMs. The framework uses an optimization-based algorithm with synthetic data to accurately identify retrieval heads, enabling efficient inference with minimal accuracy loss compared to full attention. DuoAttention also supports quantization techniques for further memory optimization, allowing for decoding of up to 3.3 million tokens on a single GPU.
llm_note
LLM notes repository contains detailed analysis on transformer models, language model compression, inference and deployment, high-performance computing, and system optimization methods. It includes discussions on various algorithms, frameworks, and performance analysis related to large language models and high-performance computing. The repository serves as a comprehensive resource for understanding and optimizing language models and computing systems.
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.