Native-LLM-for-Android

Native-LLM-for-Android

Demonstration of running a native LLM on Android device.

Stars: 98

Visit
 screenshot

This repository provides a demonstration of running a native Large Language Model (LLM) on Android devices. It supports various models such as Qwen2.5-Instruct, MiniCPM-DPO/SFT, Yuan2.0, Gemma2-it, StableLM2-Chat/Zephyr, and Phi3.5-mini-instruct. The demo models are optimized for extreme execution speed after being converted from HuggingFace or ModelScope. Users can download the demo models from the provided drive link, place them in the assets folder, and follow specific instructions for decompression and model export. The repository also includes information on quantization methods and performance benchmarks for different models on various devices.

README:

Native-LLM-for-Android

Overview

Demonstration of running a native Large Language Model (LLM) on Android devices. Currently supported models include:

  • Qwen2.5-Instruct: 0.5B, 1.5B
  • Qwen2VL: 2B
  • MiniCPM-DPO/SFT: 1B, 2.7B
  • Gemma2-it: 2B
  • Phi3.5-mini-instruct: 3.8B
  • Llama-3.2-Instruct: 1B

Getting Started

  1. Download Models:

  2. Setup Instructions:

    • Place the downloaded model files into the assets folder.
    • Decompress the *.so files stored in the libs/arm64-v8a folder.
  3. Model Notes:

    • Demo models are converted from HuggingFace or ModelScope and optimized for extreme execution speed.
    • Inputs and outputs may differ slightly from the original models.
  4. ONNX Export Considerations:

    • Dynamic axes were not used during export to better adapt to ONNX Runtime on Android. Exported ONNX models may not be optimal for x86_64 systems.

Tokenizer Files

Exporting Models

  1. Navigate to the Export_ONNX folder.
  2. Follow the comments in the Python scripts to set the folder paths.
  3. Execute the ***_Export.py script to export the model.
  4. Quantize or optimize the ONNX model manually.

Quantization Notes

  • Use onnxruntime.tools.convert_onnx_models_to_ort to convert models to *.ort format. Note that this process automatically adds Cast operators that change FP16 multiplication to FP32.
  • The quantization methods are detailed in the Do_Quantize folder.
  • The q4 (uint4) quantization method is not recommended due to poor performance of the MatMulNBits operator in ONNX Runtime.

Recent Updates

  • 2024/11/04: Added support for Qwen2VL-2B (Vision LLM).

Additional Resources

Performance Metrics

Qwen2VL

OS Device Backend Model Inference (1024 Context)
Android 13 Nubia Z50 8_Gen2-CPU (X3+A715) Qwen2VL-2B
q8f32
15 token/s
Harmony 4 P40 Kirin_990_5G-CPU (2*A76) Qwen2VL-2B
q8f32
9 token/s

Qwen

OS Device Backend Model Inference (1024 Context)
Android 13 Nubia Z50 8_Gen2-CPU (X3+A715) Qwen2-1.5B-Instruct
q8f32
20 token/s
Harmony 4 P40 Kirin_990_5G-CPU (2*A76) Qwen2-1.5B-Instruct
q8f32
13 token/s
Harmony 3 荣耀\u20 (20S) Kirin_810-CPU (2*A76) Qwen2-1.5B-Instruct
q8f32
7 token/s

MiniCPM

OS Device Backend Model Inference (1024 Context)
Android 13 Nubia Z50 8_Gen2-CPU (X3+A715) MiniCPM-2.7B
q8f32
9.5 token/s
Harmony 4 P40 Kirin_990_5G-CPU (2*A76) MiniCPM-2.7B
q8f32
6 token/s
Android 13 Nubia Z50 8_Gen2-CPU (X3+A715) MiniCPM-1.3B
q8f32
16.5 token/s
Harmony 4 P40 Kirin_990_5G-CPU (2*A76) MiniCPM-1.3B
q8f32
11 token/s

Yuan

OS Device Backend Model Inference (1024 Context)
Android 13 Nubia Z50 8_Gen2-CPU (X3+A715) Yuan2.0-2B-Mars-hf
q8f32
12 token/s
Harmony 4 P40 Kirin_990_5G-CPU (2*A76) Yuan2.0-2B-Mars-hf
q8f32
6.5 token/s

Gemma

OS Device Backend Model Inference (1024 Context)
Android 13 Nubia Z50 8_Gen2-CPU (X3+A715) Gemma1.1-it-2B
q8f32
16 token/s

StableLM

OS Device Backend Model Inference (1024 Context)
Android 13 Nubia Z50 8_Gen2-CPU (X3+A715) StableLM2-1.6B-Chat
q8f32
17.8 token/s
Harmony 4 P40 Kirin_990_5G-CPU (2*A76) StableLM2-1.6B-Chat
q8f32
11 token/s
Harmony 3 荣耀\u20 (20S) Kirin_810-CPU (2*A76) StableLM2-1.6B-Chat
q8f32
5.5 token/s

Phi

OS Device Backend Model Inference (1024 Context)
Android 13 Nubia Z50 8_Gen2-CPU (X3+A715) Phi2-2B-Orange-V2
q8f32
9.5 token/s
Harmony 4 P40 Kirin_990_5G-CPU (2*A76) Phi2-2B-Orange-V2
q8f32
5.8 token/s

Llama

OS Device Backend Model Inference (1024 Context)
Android 13 Nubia Z50 8_Gen2-CPU (X3+A715) Llama3.2-1B-Instruct
q8f32
25 token/s
Harmony 4 P40 Kirin_990_5G-CPU (2*A76) Llama3.2-1B-Instruct
q8f32
16 token/s

Demo Results

Qwen2VL-2B / 1024 Context

Demo Animation

Qwen2-1.5B / 1024 Context

Demo Animation

概述

展示在 Android 设备上运行原生大型语言模型 (LLM) 的示范。目前支持的模型包括:

  • Qwen2.5-Instruct: 0.5B, 1.5B
  • Qwen2VL: 2B
  • MiniCPM-DPO/SFT: 1B, 2.7B
  • Gemma2-it: 2B
  • Phi3.5-mini-instruct: 3.8B
  • Llama-3.2-Instruct: 1B

入门指南

  1. 下载模型:

  2. 设置说明:

    • 将下载的模型文件放入 assets 文件夹。
    • 解压存储在 libs/arm64-v8a 文件夹中的 *.so 文件。
  3. 模型说明:

    • 演示模型是从 HuggingFace 或 ModelScope 转换而来,并针对极限执行速度进行了优化。
    • 输入和输出可能与原始模型略有不同。
  4. ONNX 导出注意事项:

    • 导出时未使用动态轴,以更好地适应 Android 上的 ONNX Runtime。导出的 ONNX 模型可能不适合 x86_64 系统。

分词器文件

导出模型

  1. 进入 Export_ONNX 文件夹。
  2. 按照 Python 脚本中的注释设置文件夹路径。
  3. 执行 ***_Export.py 脚本以导出模型。
  4. 手动量化或优化 ONNX 模型。

量化说明

  • 使用 onnxruntime.tools.convert_onnx_models_to_ort 将模型转换为 *.ort 格式。注意该过程会自动添加 Cast 操作符,将 FP16 乘法改为 FP32。
  • 量化方法详见 Do_Quantize 文件夹。
  • 不推荐使用 q4 (uint4) 量化方法,因为 ONNX Runtime 中 MatMulNBits 操作符性能较差。

最近更新

  • 2024/11/04: 添加对 Qwen2VL-2B (视觉 LLM) 的支持。

额外资源

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for Native-LLM-for-Android

Similar Open Source Tools

For similar tasks

For similar jobs