Native-LLM-for-Android

Demonstration of running a native LLM on Android device.

Stars: 128

Visit

This repository provides a demonstration of running a native Large Language Model (LLM) on Android devices. It supports various models such as Qwen2.5-Instruct, MiniCPM-DPO/SFT, Yuan2.0, Gemma2-it, StableLM2-Chat/Zephyr, and Phi3.5-mini-instruct. The demo models are optimized for extreme execution speed after being converted from HuggingFace or ModelScope. Users can download the demo models from the provided drive link, place them in the assets folder, and follow specific instructions for decompression and model export. The repository also includes information on quantization methods and performance benchmarks for different models on various devices.

README:

Native-LLM-for-Android

Overview

Demonstration of running a native Large Language Model (LLM) on Android devices. Currently supported models include:

DeepSeek-R1-Distill-Qwen: 1.5B
Qwen2.5-Instruct: 0.5B, 1.5B
Qwen2/2.5VL: 2B, 3B
MiniCPM-DPO/SFT: 1B, 2.7B
Gemma2-it: 2B
Phi3.5-mini-instruct: 3.8B
Llama-3.2-Instruct: 1B

Recent Updates

2025/04/05：Update Qwen q4f32 + dynamic_axes.
2025/02/22：Support loading with low memory mode: Qwen, QwenVL, Phi_single, MiniCPM_2B_single; Set low_memory_mode = true in MainActivity.java.
2025/02/07：DeepSeek-R1-Distill-Qwen: 1.5B (Please using Qwen_Export.py)

Getting Started

Download Models:
- Demo models are available on Google Drive.
- Alternatively, use Baidu Cloud with the extraction code: dake.
- Quick Try: DeepSeek-R1-Distill-Qwen-1.5B-Android / Qwen2VL-2B
Setup Instructions:
- Place the downloaded model files into the assets folder.
- Decompress the *.so files stored in the libs/arm64-v8a folder.
Model Notes:
- Demo models are converted from HuggingFace or ModelScope and optimized for extreme execution speed.
- Inputs and outputs may differ slightly from the original models.
- For Qwen2VL / Qwen2.5VL, adjust the key variables to match the model parameters and export_config.py.
  - GLRender.java: Line 37, 38, 39
  - project.h: Line 14, 15, 16, 35, 36, 39, 116, 117, 118, 121, 122
ONNX Export Considerations:
- It is recommended to use dynamic axes and q4f32 quantization.

Tokenizer Files

The tokenizer.cpp and tokenizer.hpp files are sourced from the mnn-llm repository.

Exporting Models

Navigate to the Export_ONNX folder.
Follow the comments in the Python scripts to set the folder paths.
Execute the ***_Export.py script to export the model.
Quantize or optimize the ONNX model manually.

Quantization Notes

Use onnxruntime.tools.convert_onnx_models_to_ort to convert models to *.ort format. Note that this process automatically adds Cast operators that change FP16 multiplication to FP32.
The quantization methods are detailed in the Do_Quantize folder.

Additional Resources

Explore more projects: DakeQQ Projects

Performance Metrics

DeepSeek-R1

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU	Distill-Qwen-1.5B q4f32 dynamic	31.5 token/s
Harmony 4	P40	Kirin_990_5G-CPU	Distill-Qwen-1.5B q4f32 dynamic	20 token/s
Harmony 4	P40	Kirin_990_5G-CPU	Distill-Qwen-1.5B q8f32	13 token/s
HyperOS 2	Xiaomi-14T-Pro	MediaTek_9300+-CPU	Distill-Qwen-1.5B q8f32	22 token/s

Qwen2VL

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU	Qwen2VL-2B q8f32	15 token/s
Harmony 4	P40	Kirin_990_5G-CPU	Qwen2VL-2B q8f32	9 token/s

Qwen

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU	Qwen2-1.5B-Instruct q8f32	20 token/s
Harmony 4	P40	Kirin_990_5G-CPU	Qwen2.5-1.5B-Instruct q4f32 dynamic	20 token/s
Harmony 4	P40	Kirin_990_5G-CPU	Qwen2-1.5B-Instruct q8f32	13 token/s
Harmony 3	荣耀 20S	Kirin_810-CPU	Qwen2-1.5B-Instruct q8f32	7 token/s

MiniCPM

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU	MiniCPM-2.7B q8f32	9.5 token/s
Harmony 4	P40	Kirin_990_5G-CPU	MiniCPM-2.7B q8f32	6 token/s
Android 13	Nubia Z50	8_Gen2-CPU	MiniCPM-1.3B q8f32	16.5 token/s
Harmony 4	P40	Kirin_990_5G-CPU	MiniCPM-1.3B q8f32	11 token/s

Gemma

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU	Gemma1.1-it-2B q8f32	16 token/s

Phi

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU	Phi2-2B-Orange-V2 q8f32	9.5 token/s
Harmony 4	P40	Kirin_990_5G-CPU	Phi2-2B-Orange-V2 q8f32	5.8 token/s

Llama

OS	Device	Backend	Model	Inference (1024 Context)
Android 13	Nubia Z50	8_Gen2-CPU	Llama3.2-1B-Instruct q8f32	25 token/s
Harmony 4	P40	Kirin_990_5G-CPU	Llama3.2-1B-Instruct q8f32	16 token/s

Demo Results

Qwen2VL-2B / 1024 Context

Qwen2-1.5B / 1024 Context

概述

展示在 Android 设备上运行原生大型语言模型 (LLM) 的示范。目前支持的模型包括：

DeepSeek-R1-Distill-Qwen: 1.5B
Qwen2.5-Instruct: 0.5B, 1.5B
Qwen2/2.5VL: 2B, 3B
MiniCPM-DPO/SFT: 1B, 2.7B
Gemma2-it: 2B
Phi3.5-mini-instruct: 3.8B
Llama-3.2-Instruct: 1B

入门指南

下载模型：
- Demo模型可以在 Google Drive 上获取。
- 或者使用百度网盘提取码：dake。
- Quick Try: DeepSeek-R1-Distill-Qwen-1.5B-Android / Qwen2VL-2B
设置说明：
- 将下载的模型文件放入 assets 文件夹。
- 解压存储在 libs/arm64-v8a 文件夹中的 *.so 文件。
模型说明：
- 演示模型是从 HuggingFace 或 ModelScope 转换而来，并针对极限执行速度进行了优化。
- 输入和输出可能与原始模型略有不同。
- 对于Qwen2VL / Qwen2.5VL，请调整关键变量以匹配模型参数和export_config.py
  - GLRender.java: Line 37, 38, 39
  - project.h: Line 14, 15, 16, 35, 36, 39, 116, 117, 118, 121, 122
ONNX 导出注意事项：
- 推荐使用动态轴以及q4f32量化。

分词器文件

tokenizer.cpp 和 tokenizer.hpp 文件来源于 mnn-llm 仓库。

导出模型

进入 Export_ONNX 文件夹。
按照 Python 脚本中的注释设置文件夹路径。
执行 ***_Export.py 脚本以导出模型。
手动量化或优化 ONNX 模型。

量化说明

使用 onnxruntime.tools.convert_onnx_models_to_ort 将模型转换为 *.ort 格式。注意该过程会自动添加 Cast 操作符，将 FP16 乘法改为 FP32。
量化方法详见 Do_Quantize 文件夹。

额外资源

探索更多项目：DakeQQ Projects

For Tasks:

Click tags to check more tools for each tasks

run llm on android optimize model speed quantize onnx model export onnx model evaluate model performance

For Jobs:

machine learning engineer android developer data scientist research scientist software engineer

Alternative AI tools for Native-LLM-for-Android

Similar Open Source Tools

Native-LLM-for-Android

github

: 128

Qwen-TensorRT-LLM

Qwen-TensorRT-LLM is a project developed for the NVIDIA TensorRT Hackathon 2023, focusing on accelerating inference for the Qwen-7B-Chat model using TRT-LLM. The project offers various functionalities such as FP16/BF16 support, INT8 and INT4 quantization options, Tensor Parallel for multi-GPU parallelism, web demo setup with gradio, Triton API deployment for maximum throughput/concurrency, fastapi integration for openai requests, CLI interaction, and langchain support. It supports models like qwen2, qwen, and qwen-vl for both base and chat models. The project also provides tutorials on Bilibili and blogs for adapting Qwen models in NVIDIA TensorRT-LLM, along with hardware requirements and quick start guides for different model types and quantization methods.

github

: 484

spiceai

Spice is a portable runtime written in Rust that offers developers a unified SQL interface to materialize, accelerate, and query data from any database, data warehouse, or data lake. It connects, fuses, and delivers data to applications, machine-learning models, and AI-backends, functioning as an application-specific, tier-optimized Database CDN. Built with industry-leading technologies such as Apache DataFusion, Apache Arrow, Apache Arrow Flight, SQLite, and DuckDB. Spice makes it fast and easy to query data from one or more sources using SQL, co-locating a managed dataset with applications or machine learning models, and accelerating it with Arrow in-memory, SQLite/DuckDB, or attached PostgreSQL for fast, high-concurrency, low-latency queries.

github

: 2.1k

rulm

This repository contains language models for the Russian language, as well as their implementation and comparison. The models are trained on a dataset of ChatGPT-generated instructions and chats in Russian. They can be used for a variety of tasks, including question answering, text generation, and translation.

github

: 436

EVE

EVE is an official PyTorch implementation of Unveiling Encoder-Free Vision-Language Models. The project aims to explore the removal of vision encoders from Vision-Language Models (VLMs) and transfer LLMs to encoder-free VLMs efficiently. It also focuses on bridging the performance gap between encoder-free and encoder-based VLMs. EVE offers a superior capability with arbitrary image aspect ratio, data efficiency by utilizing publicly available data for pre-training, and training efficiency with a transparent and practical strategy for developing a pure decoder-only architecture across modalities.

github

: 155

Muice-Chatbot

Muice-Chatbot is an AI chatbot designed to proactively engage in conversations with users. It is based on the ChatGLM2-6B and Qwen-7B models, with a training dataset of 1.8K+ dialogues. The chatbot has a speaking style similar to a 2D girl, being somewhat tsundere but willing to share daily life details and greet users differently every day. It provides various functionalities, including initiating chats and offering 5 available commands. The project supports model loading through different methods and provides onebot service support for QQ users. Users can interact with the chatbot by running the main.py file in the project directory.

github

: 314

ipex-llm

IPEX-LLM is a PyTorch library for running Large Language Models (LLMs) on Intel CPUs and GPUs with very low latency. It provides seamless integration with various LLM frameworks and tools, including llama.cpp, ollama, Text-Generation-WebUI, HuggingFace transformers, and more. IPEX-LLM has been optimized and verified on over 50 LLM models, including LLaMA, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, and RWKV. It supports a range of low-bit inference formats, including INT4, FP8, FP4, INT8, INT2, FP16, and BF16, as well as finetuning capabilities for LoRA, QLoRA, DPO, QA-LoRA, and ReLoRA. IPEX-LLM is actively maintained and updated with new features and optimizations, making it a valuable tool for researchers, developers, and anyone interested in exploring and utilizing LLMs.

github

: 6.9k

DownEdit

DownEdit is a powerful program that allows you to download videos from various social media platforms such as TikTok, Douyin, Kuaishou, and more. With DownEdit, you can easily download videos from user profiles and edit them in bulk. You have the option to flip the videos horizontally or vertically throughout the entire directory with just a single click. Stay tuned for more exciting features coming soon!

github

: 301

DownEdit

DownEdit is a fast and powerful program for downloading and editing videos from platforms like TikTok, Douyin, and Kuaishou. It allows users to effortlessly grab videos, make bulk edits, and utilize advanced AI features for generating videos, images, and sounds in bulk. The tool offers features like video, photo, and sound editing, downloading videos without watermarks, bulk AI generation, and AI editing for content enhancement.

github

: 312

DownEdit

github

: 323

agentica

Agentica is a human-centric framework for building large language model agents. It provides functionalities for planning, memory management, tool usage, and supports features like reflection, planning and execution, RAG, multi-agent, multi-role, and workflow. The tool allows users to quickly code and orchestrate agents, customize prompts, and make API calls to various services. It supports API calls to OpenAI, Azure, Deepseek, Moonshot, Claude, Ollama, and Together. Agentica aims to simplify the process of building AI agents by providing a user-friendly interface and a range of functionalities for agent development.

github

: 108

vlmrun-cookbook

VLM Run Cookbook is a repository containing practical examples and tutorials for extracting structured data from images, videos, and documents using Vision Language Models (VLMs). It offers comprehensive Colab notebooks demonstrating real-world applications of VLM Run, with complete code and documentation for easy adaptation. The examples cover various domains such as financial documents and TV news analysis.

github

: 259

west

WeST is a Speech Recognition/Transcript tool developed in 300 lines of code, inspired by SLAM-ASR and LLaMA 3.1. The model includes a Language Model (LLM), a Speech Encoder, and a trainable Projector. It requires training data in jsonl format with 'wav' and 'txt' entries. WeST can be used for training and decoding speech recognition models.

github

: 135

ipex-llm

github

: 7.6k

OneClickLLAMA

OneClickLLAMA is a tool designed to run local LLM models such as Qwen2.5 and SakuraLLM with ease. It can be used in conjunction with various OpenAI format translators and analyzers, including LinguaGacha and KeywordGacha. By following the setup guides provided on the page, users can optimize performance and achieve a 3-5 times speed improvement compared to default settings. The tool requires a minimum of 8GB dedicated graphics memory, preferably NVIDIA, and the latest version of graphics drivers installed. Users can download the tool from the release page, choose the appropriate model based on usage and memory size, and start the tool by selecting the corresponding launch script.

github

: 175

build_MiniLLM_from_scratch

This repository aims to build a low-parameter LLM model through pretraining, fine-tuning, model rewarding, and reinforcement learning stages to create a chat model capable of simple conversation tasks. It features using the bert4torch training framework, seamless integration with transformers package for inference, optimized file reading during training to reduce memory usage, providing complete training logs for reproducibility, and the ability to customize robot attributes. The chat model supports multi-turn conversations. The trained model currently only supports basic chat functionality due to limitations in corpus size, model scale, SFT corpus size, and quality.

github

: 397

For similar tasks

lighteval

LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron. We're releasing it with the community in the spirit of building in the open. Note that it is still very much early so don't expect 100% stability ^^' In case of problems or question, feel free to open an issue!

github

: 1.4k

Firefly

Firefly is an open-source large model training project that supports pre-training, fine-tuning, and DPO of mainstream large models. It includes models like Llama3, Gemma, Qwen1.5, MiniCPM, Llama, InternLM, Baichuan, ChatGLM, Yi, Deepseek, Qwen, Orion, Ziya, Xverse, Mistral, Mixtral-8x7B, Zephyr, Vicuna, Bloom, etc. The project supports full-parameter training, LoRA, QLoRA efficient training, and various tasks such as pre-training, SFT, and DPO. Suitable for users with limited training resources, QLoRA is recommended for fine-tuning instructions. The project has achieved good results on the Open LLM Leaderboard with QLoRA training process validation. The latest version has significant updates and adaptations for different chat model templates.

github

: 4.8k

Awesome-Text2SQL

Awesome Text2SQL is a curated repository containing tutorials and resources for Large Language Models, Text2SQL, Text2DSL, Text2API, Text2Vis, and more. It provides guidelines on converting natural language questions into structured SQL queries, with a focus on NL2SQL. The repository includes information on various models, datasets, evaluation metrics, fine-tuning methods, libraries, and practice projects related to Text2SQL. It serves as a comprehensive resource for individuals interested in working with Text2SQL and related technologies.

github

: 1.5k

create-million-parameter-llm-from-scratch

The 'create-million-parameter-llm-from-scratch' repository provides a detailed guide on creating a Large Language Model (LLM) with 2.3 million parameters from scratch. The blog replicates the LLaMA approach, incorporating concepts like RMSNorm for pre-normalization, SwiGLU activation function, and Rotary Embeddings. The model is trained on a basic dataset to demonstrate the ease of creating a million-parameter LLM without the need for a high-end GPU.

github

: 65

StableToolBench

StableToolBench is a new benchmark developed to address the instability of Tool Learning benchmarks. It aims to balance stability and reality by introducing features such as a Virtual API System with caching and API simulators, a new set of solvable queries determined by LLMs, and a Stable Evaluation System using GPT-4. The Virtual API Server can be set up either by building from source or using a prebuilt Docker image. Users can test the server using provided scripts and evaluate models with Solvable Pass Rate and Solvable Win Rate metrics. The tool also includes model experiments results comparing different models' performance.

github

: 59

BetaML.jl

The Beta Machine Learning Toolkit is a package containing various algorithms and utilities for implementing machine learning workflows in multiple languages, including Julia, Python, and R. It offers a range of supervised and unsupervised models, data transformers, and assessment tools. The models are implemented entirely in Julia and are not wrappers for third-party models. Users can easily contribute new models or request implementations. The focus is on user-friendliness rather than computational efficiency, making it suitable for educational and research purposes.

github

: 90

AI-TOD

AI-TOD is a dataset for tiny object detection in aerial images, containing 700,621 object instances across 28,036 images. Objects in AI-TOD are smaller with a mean size of 12.8 pixels compared to other aerial image datasets. To use AI-TOD, download xView training set and AI-TOD_wo_xview, then generate the complete dataset using the provided synthesis tool. The dataset is publicly available for academic and research purposes under CC BY-NC-SA 4.0 license.

github

: 173

UMOE-Scaling-Unified-Multimodal-LLMs

Uni-MoE is a MoE-based unified multimodal model that can handle diverse modalities including audio, speech, image, text, and video. The project focuses on scaling Unified Multimodal LLMs with a Mixture of Experts framework. It offers enhanced functionality for training across multiple nodes and GPUs, as well as parallel processing at both the expert and modality levels. The model architecture involves three training stages: building connectors for multimodal understanding, developing modality-specific experts, and incorporating multiple trained experts into LLMs using the LoRA technique on mixed multimodal data. The tool provides instructions for installation, weights organization, inference, training, and evaluation on various datasets.

github

: 682

For similar jobs

react-native-vision-camera

VisionCamera is a powerful, high-performance Camera library for React Native. It features Photo and Video capture, QR/Barcode scanner, Customizable devices and multi-cameras ("fish-eye" zoom), Customizable resolutions and aspect-ratios (4k/8k images), Customizable FPS (30..240 FPS), Frame Processors (JS worklets to run facial recognition, AI object detection, realtime video chats, ...), Smooth zooming (Reanimated), Fast pause and resume, HDR & Night modes, Custom C++/GPU accelerated video pipeline (OpenGL).

github

: 8.2k

iris_android

This repository contains an offline Android chat application based on llama.cpp example. Users can install, download models, and run the app completely offline and privately. To use the app, users need to go to the releases page, download and install the app. Building the app requires downloading Android Studio, cloning the repository, and importing it into Android Studio. The app can be run offline by following specific steps such as enabling developer options, wireless debugging, and downloading the stable LM model. The project is maintained by Nerve Sparks and contributions are welcome through creating feature branches and pull requests.

github

: 152

aiolauncher_scripts

AIO Launcher Scripts is a collection of Lua scripts that can be used with AIO Launcher to enhance its functionality. These scripts can be used to create widget scripts, search scripts, and side menu scripts. They provide various functions such as displaying text, buttons, progress bars, charts, and interacting with app widgets. The scripts can be used to customize the appearance and behavior of the launcher, add new features, and interact with external services.

github

: 81

gemini-android

Gemini Android is a repository showcasing Google's Generative AI on Android using Stream Chat SDK for Compose. It demonstrates the Gemini API for Android, implements UI elements with Jetpack Compose, utilizes Android architecture components like Hilt and AppStartup, performs background tasks with Kotlin Coroutines, and integrates chat systems with Stream Chat Compose SDK for real-time event handling. The project also provides technical content, instructions on building the project, tech stack details, architecture overview, modularization strategies, and a contribution guideline. It follows Google's official architecture guidance and offers a real-world example of app architecture implementation.

github

: 303

react-native-airship

React Native Airship is a module designed to integrate Airship's iOS and Android SDKs into React Native applications. It provides developers with the necessary tools to incorporate Airship's push notification services seamlessly. The module offers a simple and efficient way to leverage Airship's features within React Native projects, enhancing user engagement and retention through targeted notifications.

github

: 86

gpt_mobile

GPT Mobile is a chat assistant for Android that allows users to chat with multiple models at once. It supports various platforms such as OpenAI GPT, Anthropic Claude, and Google Gemini. Users can customize temperature, top p (Nucleus sampling), and system prompt. The app features local chat history, Material You style UI, dark mode support, and per app language setting for Android 13+. It is built using 100% Kotlin, Jetpack Compose, and follows a modern app architecture for Android developers.

github

: 717

Native-LLM-for-Android

github

: 128

AIDE-Plus

AIDE-Plus is a comprehensive tool for Android app development, offering support for various Java syntax versions, Gradle and Maven build systems, ProGuard, AndroidX, CMake builds, APK/AAB generation, code coloring customization, data binding, and APK signing. It also provides features like AAPT2, D8, runtimeOnly, compileOnly, libgdxNatives, manifest merging, Shizuku installation support, and syntax auto-completion. The tool aims to streamline the development process and enhance the user experience by addressing common issues and providing advanced functionalities.

github

: 136

Native-LLM-for-Android

README:

Native-LLM-for-Android

Overview

Recent Updates

Getting Started

Tokenizer Files

Exporting Models

Quantization Notes

Additional Resources

Performance Metrics

DeepSeek-R1

Qwen2VL

Qwen

MiniCPM

Gemma

Phi

Llama

Demo Results

Qwen2VL-2B / 1024 Context

Qwen2-1.5B / 1024 Context

概述

最近更新

入门指南

分词器文件

导出模型

量化说明

额外资源

For Tasks:

For Jobs:

Alternative AI tools for Native-LLM-for-Android

Similar Open Source Tools

Native-LLM-for-Android

Qwen-TensorRT-LLM

spiceai

rulm

EVE

Muice-Chatbot

ipex-llm

DownEdit

DownEdit

DownEdit

agentica

vlmrun-cookbook

west

ipex-llm

OneClickLLAMA

build_MiniLLM_from_scratch

For similar tasks

lighteval

Firefly

Awesome-Text2SQL

create-million-parameter-llm-from-scratch

StableToolBench

BetaML.jl

AI-TOD

UMOE-Scaling-Unified-Multimodal-LLMs

For similar jobs

react-native-vision-camera

iris_android

aiolauncher_scripts

gemini-android

react-native-airship

gpt_mobile

Native-LLM-for-Android

AIDE-Plus