Awesome-LLMs-on-device
Awesome LLMs on Device: A Comprehensive Survey
Stars: 747
Welcome to the ultimate hub for on-device Large Language Models (LLMs)! This repository is your go-to resource for all things related to LLMs designed for on-device deployment. Whether you're a seasoned researcher, an innovative developer, or an enthusiastic learner, this comprehensive collection of cutting-edge knowledge is your gateway to understanding, leveraging, and contributing to the exciting world of on-device LLMs.
README:
Welcome to the ultimate hub for on-device Large Language Models (LLMs)! This repository is your go-to resource for all things related to LLMs designed for on-device deployment. Whether you're a seasoned researcher, an innovative developer, or an enthusiastic learner, this comprehensive collection of cutting-edge knowledge is your gateway to understanding, leveraging, and contributing to the exciting world of on-device LLMs.
- 📊 Comprehensive overview of on-device LLM evolution with easy-to-understand visualizations
- 🧠 In-depth analysis of groundbreaking architectures and optimization techniques
- 📱 Curated list of state-of-the-art models and frameworks ready for on-device deployment
- 💡 Practical examples and case studies to inspire your next project
- 🔄 Regular updates to keep you at the forefront of rapid advancements in the field
- 🤝 Active community of researchers and practitioners sharing insights and experiences
- Awesome LLMs on Device: A Comprehensive Survey
- Contents
- Tutorials and Learning Resources
- Citation
- Tinyllama: An open-source small language model
arXiv 2024 [Paper] [Github] - MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
arXiv 2024 [Paper] [Github] - MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases
arXiv 2024 [Paper] - Octopus series papers
arXiv 2024 [Octopus] [Octopus v2] [Octopus v3] [Octopus v4] [Github] - The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
arXiv 2024 [Paper] - AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
arXiv 2023 [Paper] [Github]
- The case for 4-bit precision: k-bit inference scaling laws
ICML 2023 [Paper] - Challenges and applications of large language models
arXiv 2023 [Paper] - MiniLLM: Knowledge distillation of large language models
ICLR 2023 [Paper] [github] - Gptq: Accurate post-training quantization for generative pre-trained transformers
ICLR 2023 [Paper] [Github] - Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale
NeurIPS 2022 [Paper]
- OpenELM: An Efficient Language Model Family with Open Training and Inference Framework
ICML 2024 [Paper] [Github]
- Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
arXiv 2024 [Paper] - Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
arXiv 2024 [Paper] - Exploring post-training quantization in llms from comprehensive study to low rank compensation
AAAI 2024 [Paper] - Matrix compression via randomized low rank and low precision factorization
NeurIPS 2023 [Paper] [Github]
- MNN: A lightweight deep neural network inference engine
2024 [Github] - PowerInfer-2: Fast Large Language Model Inference on a Smartphone
arXiv 2024 [Paper] [Github] - llama.cpp: Lightweight library for Approximate Nearest Neighbors and Maximum Inner Product Search
2023 [Github] - Powerinfer: Fast large language model serving with a consumer-grade gpu
arXiv 2023 [Paper] [Github] - mllm: Fast and lightweight multimodal LLM inference engine for mobile and edge devices
2023 [Github]
Model | Performance | Computational Efficiency | Memory Requirements |
---|---|---|---|
MobileLLM | High accuracy, optimized for sub-billion parameter models | Embedding sharing, grouped-query attention | Reduced model size due to deep and thin structures |
EdgeShard | Up to 50% latency reduction, 2× throughput improvement | Collaborative edge-cloud computing, optimal shard placement | Distributed model components reduce individual device load |
LLMCad | Up to 9.3× speedup in token generation | Generate-then-verify, token tree generation | Smaller LLM for token generation, larger LLM for verification |
Any-Precision LLM | Supports multiple precisions efficiently | Post-training quantization, memory-efficient design | Substantial memory savings with versatile model precisions |
Breakthrough Memory | Up to 4.5× performance improvement | PIM and PNM technologies enhance memory processing | Enhanced memory bandwidth and capacity |
MELTing Point | Provides systematic performance evaluation | Analyzes impacts of quantization, efficient model evaluation | Evaluates memory and computational efficiency trade-offs |
LLMaaS on device | Reduces context switching latency significantly | Stateful execution, fine-grained KV cache compression | Efficient memory management with tolerance-aware compression and swapping |
LocMoE | Reduces training time per epoch by up to 22.24% | Orthogonal gating weights, locality-based expert regularization | Minimizes communication overhead with group-wise All-to-All and recompute pipeline |
EdgeMoE | Significant performance improvements on edge devices | Expert-wise bitwidth adaptation, preloading experts | Efficient memory management through expert-by-expert computation reordering |
JetMoE | Outperforms Llama27B and 13B-Chat with fewer parameters | Reduces inference computation by 70% using sparse activation | 8B total parameters, only 2B activated per input token |
Pangu-$\pi $ Pro |
Neural architecture, parameter initialization, and optimization strategy for billion-level parameter models | Embedding sharing, tokenizer compression | Reduced model size via architecture tweaking |
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
arXiv 2024 [Paper] [Github] - MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
arXiv 2024 [Paper] [Github]
- EdgeShard: Efficient LLM Inference via Collaborative Edge Computing
arXiv 2024 [Paper] - Llmcad: Fast and scalable on-device large language model inference
arXiv 2023 [Paper]
- The Breakthrough Memory Solutions for Improved Performance on LLM Inference
IEEE Micro 2024 [Paper] - MELTing point: Mobile Evaluation of Language Transformers
arXiv 2024 [Paper] [Github]
- LLM as a system service on mobile devices
arXiv 2024 [Paper] - Locmoe: A low-overhead moe for large language model training
arXiv 2024 [Paper] - Edgemoe: Fast on-device inference of moe-based large language models
arXiv 2023 [Paper]
- Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
arXiv 2024 [Paper] [Github] - On the viability of using llms for sw/hw co-design: An example in designing cim dnn accelerators
IEEE SOCC 2023 [Paper]
- The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
arXiv 2024 [Paper] - AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
arXiv 2024 [Paper] [Github] - Gptq: Accurate post-training quantization for generative pre-trained transformers
ICLR 2023 [Paper] [Github] - Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale
NeurIPS 2022 [Paper]
- Challenges and applications of large language models
arXiv 2023 [Paper]
- MiniLLM: Knowledge distillation of large language models
ICLR 2024 [Paper]
- Exploring post-training quantization in llms from comprehensive study to low rank compensation
AAAI 2024 [Paper] - Matrix compression via randomized low rank and low precision factorization
NeurIPS 2023 [Paper] [Github]
- llama.cpp: A lightweight library for efficient LLM inference on various hardware with minimal setup. [Github]
- MNN: A blazing fast, lightweight deep learning framework. [Github]
- PowerInfer: A CPU/GPU LLM inference engine leveraging activation locality for device. [Github]
- ExecuTorch: A platform for On-device AI across mobile, embedded and edge for PyTorch. [Github]
- MediaPipe: A suite of tools and libraries, enables quick application of AI and ML techniques. [Github]
- MLC-LLM: A machine learning compiler and high-performance deployment engine for large language models. [Github]
- VLLM: A fast and easy-to-use library for LLM inference and serving. [Github]
- OpenLLM: An open platform for operating large language models (LLMs) in production. [Github]
- The Breakthrough Memory Solutions for Improved Performance on LLM Inference
IEEE Micro 2024 [Paper] - Aquabolt-XL: Samsung HBM2-PIM with in-memory processing for ML accelerators and beyond
IEEE Hot Chips 2021 [Paper]
- Text Generating For Messaging: Gboard smart reply
- Translation: LLMCad
- Meeting Summarizing
- Healthcare application: BioMistral-7B, HuatuoGPT
- Research Support
- Companion Robot
- Disability Support: Octopus v3, Talkback with Gemini Nano
- Autonomous Vehicles: DriveVLM
Model | Institute | Paper |
---|---|---|
Gemini Nano | Gemini: A Family of Highly Capable Multimodal Models | |
Octopus series model | Nexa AI |
Octopus v2: On-device language model for super agent Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent Octopus v4: Graph of language models Octopus: On-device language model for function calling of software APIs |
OpenELM and Ferret-v2 | Apple |
OpenELM is a significant large language model integrated within iOS to enhance application functionalities. Ferret-v2 significantly improves upon its predecessor, introducing enhanced visual processing capabilities and an advanced training regimen. |
Phi series | Microsoft | Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone |
MiniCPM | Tsinghua University | A GPT-4V Level Multimodal LLM on Your Phone |
Gemma2-9B | Gemma 2: Improving Open Language Models at a Practical Size | |
Qwen2-0.5B | Alibaba Group | Qwen Technical Report |
- MIT: TinyML and Efficient Deep Learning Computing
- Harvard: Machine Learning Systems
- Deep Learning AI : Introduction to on-device AI
We believe in the power of community! If you're passionate about on-device AI and want to contribute to this ever-growing knowledge hub, here's how you can get involved:
- Fork the repository
- Create a new branch for your brilliant additions
- Make your updates and push your changes
- Submit a pull request and become part of the on-device LLM movement
If our hub fuels your research or powers your projects, we'd be thrilled if you could cite our paper here:
@article{xu2024device,
title={On-Device Language Models: A Comprehensive Review},
author={Xu, Jiajun and Li, Zhiyuan and Chen, Wei and Wang, Qun and Gao, Xin and Cai, Qi and Ling, Ziyuan},
journal={arXiv preprint arXiv:2409.00088},
year={2024}
}
This project is open-source and available under the MIT License. See the LICENSE file for more details.
Don't just read about the future of AI – be part of it. Star this repo, spread the word, and let's push the boundaries of on-device LLMs together! 🚀🌟
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for Awesome-LLMs-on-device
Similar Open Source Tools
Awesome-LLMs-on-device
Welcome to the ultimate hub for on-device Large Language Models (LLMs)! This repository is your go-to resource for all things related to LLMs designed for on-device deployment. Whether you're a seasoned researcher, an innovative developer, or an enthusiastic learner, this comprehensive collection of cutting-edge knowledge is your gateway to understanding, leveraging, and contributing to the exciting world of on-device LLMs.
Embodied-AI-Guide
Embodied-AI-Guide is a comprehensive guide for beginners to understand Embodied AI, focusing on the path of entry and useful information in the field. It covers topics such as Reinforcement Learning, Imitation Learning, Large Language Model for Robotics, 3D Vision, Control, Benchmarks, and provides resources for building cognitive understanding. The repository aims to help newcomers quickly establish knowledge in the field of Embodied AI.
Awesome_Mamba
Awesome Mamba is a curated collection of groundbreaking research papers and articles on Mamba Architecture, a pioneering framework in deep learning known for its selective state spaces and efficiency in processing complex data structures. The repository offers a comprehensive exploration of Mamba architecture through categorized research papers covering various domains like visual recognition, speech processing, remote sensing, video processing, activity recognition, image enhancement, medical imaging, reinforcement learning, natural language processing, 3D recognition, multi-modal understanding, time series analysis, graph neural networks, point cloud analysis, and tabular data handling.
ColossalAI
Colossal-AI is a deep learning system for large-scale parallel training. It provides a unified interface to scale sequential code of model training to distributed environments. Colossal-AI supports parallel training methods such as data, pipeline, tensor, and sequence parallelism and is integrated with heterogeneous training and zero redundancy optimizer.
awesome-weather-models
A catalogue and categorization of AI-based weather forecasting models. This page provides a catalogue and categorization of AI-based weather forecasting models to enable discovery and comparison of different available model options. The weather models are categorized based on metadata found in the JSON schema specification. The table includes information such as the name of the weather model, the organization that developed it, operational data availability, open-source status, and links for further details.
Efficient-Multimodal-LLMs-Survey
Efficient Multimodal Large Language Models: A Survey provides a comprehensive review of efficient and lightweight Multimodal Large Language Models (MLLMs), focusing on model size reduction and cost efficiency for edge computing scenarios. The survey covers the timeline of efficient MLLMs, research on efficient structures and strategies, and their applications, while also discussing current limitations and future directions.
Efficient-Multimodal-LLMs-Survey
Efficient Multimodal Large Language Models: A Survey provides a comprehensive review of efficient and lightweight Multimodal Large Language Models (MLLMs), focusing on model size reduction and cost efficiency for edge computing scenarios. The survey covers the timeline of efficient MLLMs, research on efficient structures and strategies, and applications. It discusses current limitations and future directions in efficient MLLM research.
SoM-LLaVA
SoM-LLaVA is a new data source and learning paradigm for Multimodal LLMs, empowering open-source Multimodal LLMs with Set-of-Mark prompting and improved visual reasoning ability. The repository provides a new dataset that is complementary to existing training sources, enhancing multimodal LLMs with Set-of-Mark prompting and improved general capacity. By adding 30k SoM data to the visual instruction tuning stage of LLaVA, the tool achieves 1% to 6% relative improvements on all benchmarks. Users can train SoM-LLaVA via command line and utilize the implementation to annotate COCO images with SoM. Additionally, the tool can be loaded in Huggingface for further usage.
buffer-of-thought-llm
Buffer of Thoughts (BoT) is a thought-augmented reasoning framework designed to enhance the accuracy, efficiency, and robustness of large language models (LLMs). It introduces a meta-buffer to store high-level thought-templates distilled from problem-solving processes, enabling adaptive reasoning for efficient problem-solving. The framework includes a buffer-manager to dynamically update the meta-buffer, ensuring scalability and stability. BoT achieves significant performance improvements on reasoning-intensive tasks and demonstrates superior generalization ability and robustness while being cost-effective compared to other methods.
LLM-Tool-Survey
This repository contains a collection of papers related to tool learning with large language models (LLMs). The papers are organized according to the survey paper 'Tool Learning with Large Language Models: A Survey'. The survey focuses on the benefits and implementation of tool learning with LLMs, covering aspects such as task planning, tool selection, tool calling, response generation, benchmarks, evaluation, challenges, and future directions in the field. It aims to provide a comprehensive understanding of tool learning with LLMs and inspire further exploration in this emerging area.
fastRAG
fastRAG is a research framework designed to build and explore efficient retrieval-augmented generative models. It incorporates state-of-the-art Large Language Models (LLMs) and Information Retrieval to empower researchers and developers with a comprehensive tool-set for advancing retrieval augmented generation. The framework is optimized for Intel hardware, customizable, and includes key features such as optimized RAG pipelines, efficient components, and RAG-efficient components like ColBERT and Fusion-in-Decoder (FiD). fastRAG supports various unique components and backends for running LLMs, making it a versatile tool for research and development in the field of retrieval-augmented generation.
awesome-flux-ai
Awesome Flux AI is a curated list of resources, tools, libraries, and applications related to Flux AI technology. It serves as a comprehensive collection for developers, researchers, and enthusiasts interested in Flux AI. The platform offers open-source text-to-image AI models developed by Black Forest Labs, aiming to advance generative deep learning models for media, creativity, efficiency, and diversity.
llm-continual-learning-survey
This repository is an updating survey for Continual Learning of Large Language Models (CL-LLMs), providing a comprehensive overview of various aspects related to the continual learning of large language models. It covers topics such as continual pre-training, domain-adaptive pre-training, continual fine-tuning, model refinement, model alignment, multimodal LLMs, and miscellaneous aspects. The survey includes a collection of relevant papers, each focusing on different areas within the field of continual learning of large language models.
Recommendation-Systems-without-Explicit-ID-Features-A-Literature-Review
This repository is a collection of papers and resources related to recommendation systems, focusing on foundation models, transferable recommender systems, large language models, and multimodal recommender systems. It explores questions such as the necessity of ID embeddings, the shift from matching to generating paradigms, and the future of multimodal recommender systems. The papers cover various aspects of recommendation systems, including pretraining, user representation, dataset benchmarks, and evaluation methods. The repository aims to provide insights and advancements in the field of recommendation systems through literature reviews, surveys, and empirical studies.
Awesome-LLM-Survey
This repository, Awesome-LLM-Survey, serves as a comprehensive collection of surveys related to Large Language Models (LLM). It covers various aspects of LLM, including instruction tuning, human alignment, LLM agents, hallucination, multi-modal capabilities, and more. Researchers are encouraged to contribute by updating information on their papers to benefit the LLM survey community.
For similar tasks
Chinese-Mixtral-8x7B
Chinese-Mixtral-8x7B is an open-source project based on Mistral's Mixtral-8x7B model for incremental pre-training of Chinese vocabulary, aiming to advance research on MoE models in the Chinese natural language processing community. The expanded vocabulary significantly improves the model's encoding and decoding efficiency for Chinese, and the model is pre-trained incrementally on a large-scale open-source corpus, enabling it with powerful Chinese generation and comprehension capabilities. The project includes a large model with expanded Chinese vocabulary and incremental pre-training code.
Awesome-LLMs-on-device
Welcome to the ultimate hub for on-device Large Language Models (LLMs)! This repository is your go-to resource for all things related to LLMs designed for on-device deployment. Whether you're a seasoned researcher, an innovative developer, or an enthusiastic learner, this comprehensive collection of cutting-edge knowledge is your gateway to understanding, leveraging, and contributing to the exciting world of on-device LLMs.
Awesome-LLM-Quantization
Awesome-LLM-Quantization is a curated list of resources related to quantization techniques for Large Language Models (LLMs). Quantization is a crucial step in deploying LLMs on resource-constrained devices, such as mobile phones or edge devices, by reducing the model's size and computational requirements.
For similar jobs
Awesome_Mamba
Awesome Mamba is a curated collection of groundbreaking research papers and articles on Mamba Architecture, a pioneering framework in deep learning known for its selective state spaces and efficiency in processing complex data structures. The repository offers a comprehensive exploration of Mamba architecture through categorized research papers covering various domains like visual recognition, speech processing, remote sensing, video processing, activity recognition, image enhancement, medical imaging, reinforcement learning, natural language processing, 3D recognition, multi-modal understanding, time series analysis, graph neural networks, point cloud analysis, and tabular data handling.
unilm
The 'unilm' repository is a collection of tools, models, and architectures for Foundation Models and General AI, focusing on tasks such as NLP, MT, Speech, Document AI, and Multimodal AI. It includes various pre-trained models, such as UniLM, InfoXLM, DeltaLM, MiniLM, AdaLM, BEiT, LayoutLM, WavLM, VALL-E, and more, designed for tasks like language understanding, generation, translation, vision, speech, and multimodal processing. The repository also features toolkits like s2s-ft for sequence-to-sequence fine-tuning and Aggressive Decoding for efficient sequence-to-sequence decoding. Additionally, it offers applications like TrOCR for OCR, LayoutReader for reading order detection, and XLM-T for multilingual NMT.
llm-app-stack
LLM App Stack, also known as Emerging Architectures for LLM Applications, is a comprehensive list of available tools, projects, and vendors at each layer of the LLM app stack. It covers various categories such as Data Pipelines, Embedding Models, Vector Databases, Playgrounds, Orchestrators, APIs/Plugins, LLM Caches, Logging/Monitoring/Eval, Validators, LLM APIs (proprietary and open source), App Hosting Platforms, Cloud Providers, and Opinionated Clouds. The repository aims to provide a detailed overview of tools and projects for building, deploying, and maintaining enterprise data solutions, AI models, and applications.
awesome-deeplogic
Awesome deep logic is a curated list of papers and resources focusing on integrating symbolic logic into deep neural networks. It includes surveys, tutorials, and research papers that explore the intersection of logic and deep learning. The repository aims to provide valuable insights and knowledge on how logic can be used to enhance reasoning, knowledge regularization, weak supervision, and explainability in neural networks.
Awesome-LLMs-on-device
Welcome to the ultimate hub for on-device Large Language Models (LLMs)! This repository is your go-to resource for all things related to LLMs designed for on-device deployment. Whether you're a seasoned researcher, an innovative developer, or an enthusiastic learner, this comprehensive collection of cutting-edge knowledge is your gateway to understanding, leveraging, and contributing to the exciting world of on-device LLMs.
duo-attention
DuoAttention is a framework designed to optimize long-context large language models (LLMs) by reducing memory and latency during inference without compromising their long-context abilities. It introduces a concept of Retrieval Heads and Streaming Heads to efficiently manage attention across tokens. By applying a full Key and Value (KV) cache to retrieval heads and a lightweight, constant-length KV cache to streaming heads, DuoAttention achieves significant reductions in memory usage and decoding time for LLMs. The framework uses an optimization-based algorithm with synthetic data to accurately identify retrieval heads, enabling efficient inference with minimal accuracy loss compared to full attention. DuoAttention also supports quantization techniques for further memory optimization, allowing for decoding of up to 3.3 million tokens on a single GPU.
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
agentcloud
AgentCloud is an open-source platform that enables companies to build and deploy private LLM chat apps, empowering teams to securely interact with their data. It comprises three main components: Agent Backend, Webapp, and Vector Proxy. To run this project locally, clone the repository, install Docker, and start the services. The project is licensed under the GNU Affero General Public License, version 3 only. Contributions and feedback are welcome from the community.