
Native-LLM-for-Android
Demonstration of running a native LLM on Android device.
Stars: 120

This repository provides a demonstration of running a native Large Language Model (LLM) on Android devices. It supports various models such as Qwen2.5-Instruct, MiniCPM-DPO/SFT, Yuan2.0, Gemma2-it, StableLM2-Chat/Zephyr, and Phi3.5-mini-instruct. The demo models are optimized for extreme execution speed after being converted from HuggingFace or ModelScope. Users can download the demo models from the provided drive link, place them in the assets folder, and follow specific instructions for decompression and model export. The repository also includes information on quantization methods and performance benchmarks for different models on various devices.
README:
Demonstration of running a native Large Language Model (LLM) on Android devices. Currently supported models include:
- DeepSeek-R1-Distill-Qwen: 1.5B
- Qwen2.5-Instruct: 0.5B, 1.5B
- Qwen2/2.5VL: 2B, 3B
- MiniCPM-DPO/SFT: 1B, 2.7B
- Gemma2-it: 2B
- Phi3.5-mini-instruct: 3.8B
- Llama-3.2-Instruct: 1B
-
Download Models:
- Demo models are available on Google Drive.
- Alternatively, use Baidu Cloud with the extraction code:
dake
. - Quick Try: DeepSeek-R1-Distill-Qwen-1.5B-Android / Qwen2VL-2B
-
Setup Instructions:
- Place the downloaded model files into the
assets
folder. - Decompress the
*.so
files stored in thelibs/arm64-v8a
folder.
- Place the downloaded model files into the
-
Model Notes:
- Demo models are converted from HuggingFace or ModelScope and optimized for extreme execution speed.
- Inputs and outputs may differ slightly from the original models.
- For Qwen2VL / Qwen2.5VL, adjust the key variables to match the model parameters and
export_config.py
.GLRender.java: Line 37, 38, 39
project.h: Line 14, 15, 16, 35, 36, 39, 116, 117, 118, 121, 122
-
ONNX Export Considerations:
- Dynamic axes were not used during export to better adapt to ONNX Runtime on Android. Exported ONNX models may not be optimal for x86_64 systems.
- The
tokenizer.cpp
andtokenizer.hpp
files are sourced from the mnn-llm repository.
- Navigate to the
Export_ONNX
folder. - Follow the comments in the Python scripts to set the folder paths.
- Execute the
***_Export.py
script to export the model. - Quantize or optimize the ONNX model manually.
- Use
onnxruntime.tools.convert_onnx_models_to_ort
to convert models to*.ort
format. Note that this process automatically addsCast
operators that change FP16 multiplication to FP32. - The quantization methods are detailed in the
Do_Quantize
folder. - The
q4
(uint4) quantization method is not recommended due to poor performance of theMatMulNBits
operator in ONNX Runtime.
- 2025/02/07:DeepSeek-R1-Distill-Qwen: 1.5B (Please using Qwen_Export.py)
- 2025/02/22:Support loading with low memory mode:
Qwen
,QwenVL
,Phi_single
,MiniCPM_2B_single
; Setlow_memory_mode = true
inMainActivity.java
.
- Explore more projects: DakeQQ Projects
OS | Device | Backend | Model | Inference (1024 Context) |
---|---|---|---|---|
Harmony 4 | P40 | Kirin_990_5G-CPU | Distill-Qwen-1.5B q8f32 |
13 token/s |
HyperOS 2 | Xiaomi-14T-Pro | MediaTek_9300+-CPU | Distill-Qwen-1.5B q8f32 |
22 token/s |
OS | Device | Backend | Model | Inference (1024 Context) |
---|---|---|---|---|
Android 13 | Nubia Z50 | 8_Gen2-CPU | Qwen2VL-2B q8f32 |
15 token/s |
Harmony 4 | P40 | Kirin_990_5G-CPU | Qwen2VL-2B q8f32 |
9 token/s |
OS | Device | Backend | Model | Inference (1024 Context) |
---|---|---|---|---|
Android 13 | Nubia Z50 | 8_Gen2-CPU | Qwen2-1.5B-Instruct q8f32 |
20 token/s |
Harmony 4 | P40 | Kirin_990_5G-CPU | Qwen2-1.5B-Instruct q8f32 |
13 token/s |
Harmony 3 | 荣耀 20S | Kirin_810-CPU | Qwen2-1.5B-Instruct q8f32 |
7 token/s |
OS | Device | Backend | Model | Inference (1024 Context) |
---|---|---|---|---|
Android 13 | Nubia Z50 | 8_Gen2-CPU | MiniCPM-2.7B q8f32 |
9.5 token/s |
Harmony 4 | P40 | Kirin_990_5G-CPU | MiniCPM-2.7B q8f32 |
6 token/s |
Android 13 | Nubia Z50 | 8_Gen2-CPU | MiniCPM-1.3B q8f32 |
16.5 token/s |
Harmony 4 | P40 | Kirin_990_5G-CPU | MiniCPM-1.3B q8f32 |
11 token/s |
OS | Device | Backend | Model | Inference (1024 Context) |
---|---|---|---|---|
Android 13 | Nubia Z50 | 8_Gen2-CPU | Gemma1.1-it-2B q8f32 |
16 token/s |
OS | Device | Backend | Model | Inference (1024 Context) |
---|---|---|---|---|
Android 13 | Nubia Z50 | 8_Gen2-CPU | Phi2-2B-Orange-V2 q8f32 |
9.5 token/s |
Harmony 4 | P40 | Kirin_990_5G-CPU | Phi2-2B-Orange-V2 q8f32 |
5.8 token/s |
OS | Device | Backend | Model | Inference (1024 Context) |
---|---|---|---|---|
Android 13 | Nubia Z50 | 8_Gen2-CPU | Llama3.2-1B-Instruct q8f32 |
25 token/s |
Harmony 4 | P40 | Kirin_990_5G-CPU | Llama3.2-1B-Instruct q8f32 |
16 token/s |
展示在 Android 设备上运行原生大型语言模型 (LLM) 的示范。目前支持的模型包括:
- DeepSeek-R1-Distill-Qwen: 1.5B
- Qwen2.5-Instruct: 0.5B, 1.5B
- Qwen2/2.5VL: 2B, 3B
- MiniCPM-DPO/SFT: 1B, 2.7B
- Gemma2-it: 2B
- Phi3.5-mini-instruct: 3.8B
- Llama-3.2-Instruct: 1B
-
下载模型:
- Demo模型可以在 Google Drive 上获取。
- 或者使用 百度网盘 提取码:
dake
。 - Quick Try: DeepSeek-R1-Distill-Qwen-1.5B-Android / Qwen2VL-2B
-
设置说明:
- 将下载的模型文件放入
assets
文件夹。 - 解压存储在
libs/arm64-v8a
文件夹中的*.so
文件。
- 将下载的模型文件放入
-
模型说明:
- 演示模型是从 HuggingFace 或 ModelScope 转换而来,并针对极限执行速度进行了优化。
- 输入和输出可能与原始模型略有不同。
- 对于Qwen2VL / Qwen2.5VL,请调整关键变量以匹配模型参数和
export_config.py
GLRender.java: Line 37, 38, 39
project.h: Line 14, 15, 16, 35, 36, 39, 116, 117, 118, 121, 122
-
ONNX 导出注意事项:
- 导出时未使用动态轴,以更好地适应 Android 上的 ONNX Runtime。导出的 ONNX 模型可能不适合 x86_64 系统。
-
tokenizer.cpp
和tokenizer.hpp
文件来源于 mnn-llm 仓库。
- 进入
Export_ONNX
文件夹。 - 按照 Python 脚本中的注释设置文件夹路径。
- 执行
***_Export.py
脚本以导出模型。 - 手动量化或优化 ONNX 模型。
- 使用
onnxruntime.tools.convert_onnx_models_to_ort
将模型转换为*.ort
格式。注意该过程会自动添加Cast
操作符,将 FP16 乘法改为 FP32。 - 量化方法详见
Do_Quantize
文件夹。 - 不推荐使用
q4
(uint4) 量化方法,因为 ONNX Runtime 中MatMulNBits
操作符性能较差。
- 2025/02/07:DeepSeek-R1-Distill-Qwen: 1.5B (请使用Qwen_Export.py)
- 2025/02/22:支持低内存模式加载:
Qwen
,QwenVL
,Phi_single
,MiniCPM_2B_single
; Setlow_memory_mode = true
inMainActivity.java
.
- 探索更多项目:DakeQQ Projects
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for Native-LLM-for-Android
Similar Open Source Tools

Native-LLM-for-Android
This repository provides a demonstration of running a native Large Language Model (LLM) on Android devices. It supports various models such as Qwen2.5-Instruct, MiniCPM-DPO/SFT, Yuan2.0, Gemma2-it, StableLM2-Chat/Zephyr, and Phi3.5-mini-instruct. The demo models are optimized for extreme execution speed after being converted from HuggingFace or ModelScope. Users can download the demo models from the provided drive link, place them in the assets folder, and follow specific instructions for decompression and model export. The repository also includes information on quantization methods and performance benchmarks for different models on various devices.

Qwen-TensorRT-LLM
Qwen-TensorRT-LLM is a project developed for the NVIDIA TensorRT Hackathon 2023, focusing on accelerating inference for the Qwen-7B-Chat model using TRT-LLM. The project offers various functionalities such as FP16/BF16 support, INT8 and INT4 quantization options, Tensor Parallel for multi-GPU parallelism, web demo setup with gradio, Triton API deployment for maximum throughput/concurrency, fastapi integration for openai requests, CLI interaction, and langchain support. It supports models like qwen2, qwen, and qwen-vl for both base and chat models. The project also provides tutorials on Bilibili and blogs for adapting Qwen models in NVIDIA TensorRT-LLM, along with hardware requirements and quick start guides for different model types and quantization methods.

spiceai
Spice is a portable runtime written in Rust that offers developers a unified SQL interface to materialize, accelerate, and query data from any database, data warehouse, or data lake. It connects, fuses, and delivers data to applications, machine-learning models, and AI-backends, functioning as an application-specific, tier-optimized Database CDN. Built with industry-leading technologies such as Apache DataFusion, Apache Arrow, Apache Arrow Flight, SQLite, and DuckDB. Spice makes it fast and easy to query data from one or more sources using SQL, co-locating a managed dataset with applications or machine learning models, and accelerating it with Arrow in-memory, SQLite/DuckDB, or attached PostgreSQL for fast, high-concurrency, low-latency queries.

EVE
EVE is an official PyTorch implementation of Unveiling Encoder-Free Vision-Language Models. The project aims to explore the removal of vision encoders from Vision-Language Models (VLMs) and transfer LLMs to encoder-free VLMs efficiently. It also focuses on bridging the performance gap between encoder-free and encoder-based VLMs. EVE offers a superior capability with arbitrary image aspect ratio, data efficiency by utilizing publicly available data for pre-training, and training efficiency with a transparent and practical strategy for developing a pure decoder-only architecture across modalities.

nexa-sdk
Nexa SDK is a comprehensive toolkit supporting ONNX and GGML models for text generation, image generation, vision-language models (VLM), and text-to-speech (TTS) capabilities. It offers an OpenAI-compatible API server with JSON schema mode and streaming support, along with a user-friendly Streamlit UI. Users can run Nexa SDK on any device with Python environment, with GPU acceleration supported. The toolkit provides model support, conversion engine, inference engine for various tasks, and differentiating features from other tools.

vlmrun-cookbook
VLM Run Cookbook is a repository containing practical examples and tutorials for extracting structured data from images, videos, and documents using Vision Language Models (VLMs). It offers comprehensive Colab notebooks demonstrating real-world applications of VLM Run, with complete code and documentation for easy adaptation. The examples cover various domains such as financial documents and TV news analysis.

ipex-llm
IPEX-LLM is a PyTorch library for running Large Language Models (LLMs) on Intel CPUs and GPUs with very low latency. It provides seamless integration with various LLM frameworks and tools, including llama.cpp, ollama, Text-Generation-WebUI, HuggingFace transformers, and more. IPEX-LLM has been optimized and verified on over 50 LLM models, including LLaMA, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, and RWKV. It supports a range of low-bit inference formats, including INT4, FP8, FP4, INT8, INT2, FP16, and BF16, as well as finetuning capabilities for LoRA, QLoRA, DPO, QA-LoRA, and ReLoRA. IPEX-LLM is actively maintained and updated with new features and optimizations, making it a valuable tool for researchers, developers, and anyone interested in exploring and utilizing LLMs.

DownEdit
DownEdit is a fast and powerful program for downloading and editing videos from platforms like TikTok, Douyin, and Kuaishou. It allows users to effortlessly grab videos, make bulk edits, and utilize advanced AI features for generating videos, images, and sounds in bulk. The tool offers features like video, photo, and sound editing, downloading videos without watermarks, bulk AI generation, and AI editing for content enhancement.

DownEdit
DownEdit is a powerful program that allows you to download videos from various social media platforms such as TikTok, Douyin, Kuaishou, and more. With DownEdit, you can easily download videos from user profiles and edit them in bulk. You have the option to flip the videos horizontally or vertically throughout the entire directory with just a single click. Stay tuned for more exciting features coming soon!

rulm
This repository contains language models for the Russian language, as well as their implementation and comparison. The models are trained on a dataset of ChatGPT-generated instructions and chats in Russian. They can be used for a variety of tasks, including question answering, text generation, and translation.

west
WeST is a Speech Recognition/Transcript tool developed in 300 lines of code, inspired by SLAM-ASR and LLaMA 3.1. The model includes a Language Model (LLM), a Speech Encoder, and a trainable Projector. It requires training data in jsonl format with 'wav' and 'txt' entries. WeST can be used for training and decoding speech recognition models.

skpro
skpro is a library for supervised probabilistic prediction in python. It provides `scikit-learn`-like, `scikit-base` compatible interfaces to: * tabular **supervised regressors for probabilistic prediction** \- interval, quantile and distribution predictions * tabular **probabilistic time-to-event and survival prediction** \- instance-individual survival distributions * **metrics to evaluate probabilistic predictions** , e.g., pinball loss, empirical coverage, CRPS, survival losses * **reductions** to turn `scikit-learn` regressors into probabilistic `skpro` regressors, such as bootstrap or conformal * building **pipelines and composite models** , including tuning via probabilistic performance metrics * symbolic **probability distributions** with value domain of `pandas.DataFrame`-s and `pandas`-like interface

cursor-free-vip
Cursor Free VIP is a tool designed to automatically bypass Cursor's membership check, upgrade to 'pro' membership, support Windows and macOS systems, send Token requests in real-time, and reset Cursor's configuration. It provides a seamless experience for users to access premium features without the need for manual upgrades or configuration changes. The tool aims to simplify the process of accessing advanced functionalities offered by Cursor, enhancing user experience and productivity.

build_MiniLLM_from_scratch
This repository aims to build a low-parameter LLM model through pretraining, fine-tuning, model rewarding, and reinforcement learning stages to create a chat model capable of simple conversation tasks. It features using the bert4torch training framework, seamless integration with transformers package for inference, optimized file reading during training to reduce memory usage, providing complete training logs for reproducibility, and the ability to customize robot attributes. The chat model supports multi-turn conversations. The trained model currently only supports basic chat functionality due to limitations in corpus size, model scale, SFT corpus size, and quality.

Muice-Chatbot
Muice-Chatbot is an AI chatbot designed to proactively engage in conversations with users. It is based on the ChatGLM2-6B and Qwen-7B models, with a training dataset of 1.8K+ dialogues. The chatbot has a speaking style similar to a 2D girl, being somewhat tsundere but willing to share daily life details and greet users differently every day. It provides various functionalities, including initiating chats and offering 5 available commands. The project supports model loading through different methods and provides onebot service support for QQ users. Users can interact with the chatbot by running the main.py file in the project directory.

XianyuAutoAgent
Xianyu AutoAgent is an AI customer service robot system specifically designed for the Xianyu platform, providing 24/7 automated customer service, supporting multi-expert collaborative decision-making, intelligent bargaining, and context-aware conversations. The system includes intelligent conversation engine with features like context awareness and expert routing, business function matrix with modules like core engine, bargaining system, technical support, and operation monitoring. It requires Python 3.8+ and NodeJS 18+ for installation and operation. Users can customize prompts for different experts and contribute to the project through issues or pull requests.
For similar tasks

lighteval
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron. We're releasing it with the community in the spirit of building in the open. Note that it is still very much early so don't expect 100% stability ^^' In case of problems or question, feel free to open an issue!

Firefly
Firefly is an open-source large model training project that supports pre-training, fine-tuning, and DPO of mainstream large models. It includes models like Llama3, Gemma, Qwen1.5, MiniCPM, Llama, InternLM, Baichuan, ChatGLM, Yi, Deepseek, Qwen, Orion, Ziya, Xverse, Mistral, Mixtral-8x7B, Zephyr, Vicuna, Bloom, etc. The project supports full-parameter training, LoRA, QLoRA efficient training, and various tasks such as pre-training, SFT, and DPO. Suitable for users with limited training resources, QLoRA is recommended for fine-tuning instructions. The project has achieved good results on the Open LLM Leaderboard with QLoRA training process validation. The latest version has significant updates and adaptations for different chat model templates.

Awesome-Text2SQL
Awesome Text2SQL is a curated repository containing tutorials and resources for Large Language Models, Text2SQL, Text2DSL, Text2API, Text2Vis, and more. It provides guidelines on converting natural language questions into structured SQL queries, with a focus on NL2SQL. The repository includes information on various models, datasets, evaluation metrics, fine-tuning methods, libraries, and practice projects related to Text2SQL. It serves as a comprehensive resource for individuals interested in working with Text2SQL and related technologies.

create-million-parameter-llm-from-scratch
The 'create-million-parameter-llm-from-scratch' repository provides a detailed guide on creating a Large Language Model (LLM) with 2.3 million parameters from scratch. The blog replicates the LLaMA approach, incorporating concepts like RMSNorm for pre-normalization, SwiGLU activation function, and Rotary Embeddings. The model is trained on a basic dataset to demonstrate the ease of creating a million-parameter LLM without the need for a high-end GPU.

StableToolBench
StableToolBench is a new benchmark developed to address the instability of Tool Learning benchmarks. It aims to balance stability and reality by introducing features such as a Virtual API System with caching and API simulators, a new set of solvable queries determined by LLMs, and a Stable Evaluation System using GPT-4. The Virtual API Server can be set up either by building from source or using a prebuilt Docker image. Users can test the server using provided scripts and evaluate models with Solvable Pass Rate and Solvable Win Rate metrics. The tool also includes model experiments results comparing different models' performance.

BetaML.jl
The Beta Machine Learning Toolkit is a package containing various algorithms and utilities for implementing machine learning workflows in multiple languages, including Julia, Python, and R. It offers a range of supervised and unsupervised models, data transformers, and assessment tools. The models are implemented entirely in Julia and are not wrappers for third-party models. Users can easily contribute new models or request implementations. The focus is on user-friendliness rather than computational efficiency, making it suitable for educational and research purposes.

AI-TOD
AI-TOD is a dataset for tiny object detection in aerial images, containing 700,621 object instances across 28,036 images. Objects in AI-TOD are smaller with a mean size of 12.8 pixels compared to other aerial image datasets. To use AI-TOD, download xView training set and AI-TOD_wo_xview, then generate the complete dataset using the provided synthesis tool. The dataset is publicly available for academic and research purposes under CC BY-NC-SA 4.0 license.

UMOE-Scaling-Unified-Multimodal-LLMs
Uni-MoE is a MoE-based unified multimodal model that can handle diverse modalities including audio, speech, image, text, and video. The project focuses on scaling Unified Multimodal LLMs with a Mixture of Experts framework. It offers enhanced functionality for training across multiple nodes and GPUs, as well as parallel processing at both the expert and modality levels. The model architecture involves three training stages: building connectors for multimodal understanding, developing modality-specific experts, and incorporating multiple trained experts into LLMs using the LoRA technique on mixed multimodal data. The tool provides instructions for installation, weights organization, inference, training, and evaluation on various datasets.
For similar jobs

react-native-vision-camera
VisionCamera is a powerful, high-performance Camera library for React Native. It features Photo and Video capture, QR/Barcode scanner, Customizable devices and multi-cameras ("fish-eye" zoom), Customizable resolutions and aspect-ratios (4k/8k images), Customizable FPS (30..240 FPS), Frame Processors (JS worklets to run facial recognition, AI object detection, realtime video chats, ...), Smooth zooming (Reanimated), Fast pause and resume, HDR & Night modes, Custom C++/GPU accelerated video pipeline (OpenGL).

iris_android
This repository contains an offline Android chat application based on llama.cpp example. Users can install, download models, and run the app completely offline and privately. To use the app, users need to go to the releases page, download and install the app. Building the app requires downloading Android Studio, cloning the repository, and importing it into Android Studio. The app can be run offline by following specific steps such as enabling developer options, wireless debugging, and downloading the stable LM model. The project is maintained by Nerve Sparks and contributions are welcome through creating feature branches and pull requests.

aiolauncher_scripts
AIO Launcher Scripts is a collection of Lua scripts that can be used with AIO Launcher to enhance its functionality. These scripts can be used to create widget scripts, search scripts, and side menu scripts. They provide various functions such as displaying text, buttons, progress bars, charts, and interacting with app widgets. The scripts can be used to customize the appearance and behavior of the launcher, add new features, and interact with external services.

gemini-android
Gemini Android is a repository showcasing Google's Generative AI on Android using Stream Chat SDK for Compose. It demonstrates the Gemini API for Android, implements UI elements with Jetpack Compose, utilizes Android architecture components like Hilt and AppStartup, performs background tasks with Kotlin Coroutines, and integrates chat systems with Stream Chat Compose SDK for real-time event handling. The project also provides technical content, instructions on building the project, tech stack details, architecture overview, modularization strategies, and a contribution guideline. It follows Google's official architecture guidance and offers a real-world example of app architecture implementation.

react-native-airship
React Native Airship is a module designed to integrate Airship's iOS and Android SDKs into React Native applications. It provides developers with the necessary tools to incorporate Airship's push notification services seamlessly. The module offers a simple and efficient way to leverage Airship's features within React Native projects, enhancing user engagement and retention through targeted notifications.

gpt_mobile
GPT Mobile is a chat assistant for Android that allows users to chat with multiple models at once. It supports various platforms such as OpenAI GPT, Anthropic Claude, and Google Gemini. Users can customize temperature, top p (Nucleus sampling), and system prompt. The app features local chat history, Material You style UI, dark mode support, and per app language setting for Android 13+. It is built using 100% Kotlin, Jetpack Compose, and follows a modern app architecture for Android developers.

Native-LLM-for-Android
This repository provides a demonstration of running a native Large Language Model (LLM) on Android devices. It supports various models such as Qwen2.5-Instruct, MiniCPM-DPO/SFT, Yuan2.0, Gemma2-it, StableLM2-Chat/Zephyr, and Phi3.5-mini-instruct. The demo models are optimized for extreme execution speed after being converted from HuggingFace or ModelScope. Users can download the demo models from the provided drive link, place them in the assets folder, and follow specific instructions for decompression and model export. The repository also includes information on quantization methods and performance benchmarks for different models on various devices.

AIDE-Plus
AIDE-Plus is a comprehensive tool for Android app development, offering support for various Java syntax versions, Gradle and Maven build systems, ProGuard, AndroidX, CMake builds, APK/AAB generation, code coloring customization, data binding, and APK signing. It also provides features like AAPT2, D8, runtimeOnly, compileOnly, libgdxNatives, manifest merging, Shizuku installation support, and syntax auto-completion. The tool aims to streamline the development process and enhance the user experience by addressing common issues and providing advanced functionalities.