nexa-sdk

nexa-sdk

Nexa SDK is a comprehensive toolkit for supporting GGML and ONNX models. It supports text generation, image generation, vision-language models (VLM), Audio Language Model, auto-speech-recognition (ASR), and text-to-speech (TTS) capabilities.

Stars: 5474

Visit
 screenshot

Nexa SDK is a comprehensive toolkit supporting ONNX and GGML models for text generation, image generation, vision-language models (VLM), and text-to-speech (TTS) capabilities. It offers an OpenAI-compatible API server with JSON schema mode and streaming support, along with a user-friendly Streamlit UI. Users can run Nexa SDK on any device with Python environment, with GPU acceleration supported. The toolkit provides model support, conversion engine, inference engine for various tasks, and differentiating features from other tools.

README:

Nexa SDK - Local On-Device Inference Framework

MacOS Linux Windows Build workflow GitHub License GitHub Release PyPI

On-Device Model Hub | Documentation | Discord | Blogs | X (Twitter)

Nexa SDK is a local on-device inference framework for ONNX and GGML models, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer.

Features

  • Device Support: CPU, GPU (CUDA, Metal, ROCm), iOS
  • Server: OpenAI-compatible API, JSON schema for function calling and streaming support
  • Local UI: Streamlit for interactive model deployment and testing

Latest News đŸ”Ĩ

  • Support Nexa AI's own vision language model (0.9B parameters): nexa run omniVLM and audio language model (2.9B parameters): nexa run omniaudio
  • Support audio language model: nexa run qwen2audio, we are the first open-source toolkit to support audio language model with GGML tensor library.
  • Support iOS Swift binding for local inference on iOS mobile devices.
  • Support embedding model: nexa embed <model_path> <prompt>
  • Support pull and run supported Computer Vision models in GGUF format from HuggingFace or ModelScope: nexa run -hf <hf_model_id> -mt COMPUTER_VISION or nexa run -ms <ms_model_id> -mt COMPUTER_VISION
  • Support pull and run NLP models in GGUF format from HuggingFace or ModelScope: nexa run -hf <hf_model_id> -mt NLP or nexa run -ms <ms_model_id> -mt NLP

Welcome to submit your requests through issues, we ship weekly.

Install Option 1: Executable Installer

macOS Installer

Windows Installer

Linux Installer

curl -fsSL https://public-storage.nexa4ai.com/install.sh | sh
FAQ: cannot use executable with nexaai python package already installed

Try using nexa-exe instead:

nexa-exe <command>

Install Option 2: Python Package

We have released pre-built wheels for various Python versions, platforms, and backends for convenient installation on our index page.

CPU
pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/cpu --extra-index-url https://pypi.org/simple --no-cache-dir
Apple GPU (Metal)

For the GPU version supporting Metal (macOS):

CMAKE_ARGS="-DGGML_METAL=ON -DSD_METAL=ON" pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/metal --extra-index-url https://pypi.org/simple --no-cache-dir
FAQ: cannot use Metal/GPU on M1

Try the following command:

wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
bash Miniforge3-MacOSX-arm64.sh
conda create -n nexasdk python=3.10
conda activate nexasdk
CMAKE_ARGS="-DGGML_METAL=ON -DSD_METAL=ON" pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/metal --extra-index-url https://pypi.org/simple --no-cache-dir
Nvidia GPU (CUDA)

To install with CUDA support, make sure you have CUDA Toolkit 12.0 or later installed.

For Linux:

CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON" pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir

For Windows PowerShell:

$env:CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON"; pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir

For Windows Command Prompt:

set CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON" & pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir

For Windows Git Bash:

CMAKE_ARGS="-DGGML_CUDA=ON -DSD_CUBLAS=ON" pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/cu124 --extra-index-url https://pypi.org/simple --no-cache-dir
FAQ: Building Issues for llava

If you encounter the following issue while building:

try the following command:

CMAKE_ARGS="-DCMAKE_CXX_FLAGS=-fopenmp" pip install nexaai
AMD GPU (ROCm)

To install with ROCm support, make sure you have ROCm 6.2.1 or later installed.

For Linux:

CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/rocm621 --extra-index-url https://pypi.org/simple --no-cache-dir
GPU (Vulkan)

To install with Vulkan support, make sure you have Vulkan SDK 1.3.261.1 or later installed.

For Windows PowerShell:

$env:CMAKE_ARGS="-DGGML_VULKAN=on"; pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/vulkan --extra-index-url https://pypi.org/simple --no-cache-dir

For Windows Command Prompt:

set CMAKE_ARGS="-DGGML_VULKAN=on" & pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/vulkan --extra-index-url https://pypi.org/simple --no-cache-dir

For Windows Git Bash:

CMAKE_ARGS="-DGGML_VULKAN=on" pip install nexaai --prefer-binary --index-url https://github.nexa.ai/whl/vulkan --extra-index-url https://pypi.org/simple --no-cache-dir
Local Build

How to clone this repo

git clone --recursive https://github.com/NexaAI/nexa-sdk

If you forget to use --recursive, you can use below command to add submodule

git submodule update --init --recursive

Then you can build and install the package

pip install -e .

Differentiation

Below is our differentiation from other similar tools:

Feature Nexa SDK ollama Optimum LM Studio
GGML Support ✅ ✅ ❌ ✅
ONNX Support ✅ ❌ ✅ ❌
Text Generation ✅ ✅ ✅ ✅
Image Generation ✅ ❌ ❌ ❌
Vision-Language Models ✅ ✅ ✅ ✅
Audio-Language Models ✅ ❌ ❌ ❌
Text-to-Speech ✅ ❌ ✅ ❌
Server Capability ✅ ✅ ✅ ✅
User Interface ✅ ❌ ❌ ✅
Executable Installation ✅ ✅ ❌ ✅

Supported Models & Model Hub

Our on-device model hub offers all types of quantized models (text, image, audio, multimodal) with filters for RAM, file size, Tasks, etc. to help you easily explore models with UI. Explore on-device models at On-device Model Hub

Supported model examples (full list at Model Hub):

Model Type Format Command
omniaudio AudioLM GGUF nexa run omniaudio
qwen2audio AudioLM GGUF nexa run qwen2audio
octopus-v2 Function Call GGUF nexa run octopus-v2
octo-net Text GGUF nexa run octo-net
omniVLM Multimodal GGUF nexa run omniVLM
nanollava Multimodal GGUF nexa run nanollava
llava-phi3 Multimodal GGUF nexa run llava-phi3
llava-llama3 Multimodal GGUF nexa run llava-llama3
llava1.6-mistral Multimodal GGUF nexa run llava1.6-mistral
llava1.6-vicuna Multimodal GGUF nexa run llava1.6-vicuna
llama3.2 Text GGUF nexa run llama3.2
llama3-uncensored Text GGUF nexa run llama3-uncensored
gemma2 Text GGUF nexa run gemma2
qwen2.5 Text GGUF nexa run qwen2.5
mathqwen Text GGUF nexa run mathqwen
codeqwen Text GGUF nexa run codeqwen
mistral Text GGUF/ONNX nexa run mistral
deepseek-coder Text GGUF nexa run deepseek-coder
phi3.5 Text GGUF nexa run phi3.5
openelm Text GGUF nexa run openelm
stable-diffusion-v2-1 Image Generation GGUF nexa run sd2-1
stable-diffusion-3-medium Image Generation GGUF nexa run sd3
FLUX.1-schnell Image Generation GGUF nexa run flux
lcm-dreamshaper Image Generation GGUF/ONNX nexa run lcm-dreamshaper
whisper-large-v3-turbo Speech-to-Text BIN nexa run faster-whisper-large-turbo
whisper-tiny.en Speech-to-Text ONNX nexa run whisper-tiny.en
mxbai-embed-large-v1 Embedding GGUF nexa embed mxbai
nomic-embed-text-v1.5 Embedding GGUF nexa embed nomic
all-MiniLM-L12-v2 Embedding GGUF nexa embed all-MiniLM-L12-v2:fp16
bark-small Text-to-Speech GGUF nexa run bark-small:fp16

Run Models from 🤗 HuggingFace or 🤖 ModelScope

You can pull, convert (to .gguf), quantize and run llama.cpp supported text generation models from HF or MS with Nexa SDK.

Run .gguf File

Use nexa run -hf <hf-model-id> or nexa run -ms <ms-model-id> to run models with provided .gguf files:

nexa run -hf Qwen/Qwen2.5-Coder-7B-Instruct-GGUF
nexa run -ms Qwen/Qwen2.5-Coder-7B-Instruct-GGUF

Note: You will be prompted to select a single .gguf file. If your desired quantization version has multiple split files (like fp16-00001-of-00004), please use Nexa's conversion tool (see below) to convert and quantize the model locally.

Convert .safetensors Files

Install Nexa Python package, and install Nexa conversion tool with pip install "nexaai[convert]", then convert models from huggingface with nexa convert <hf-model-id>:

nexa convert HuggingFaceTB/SmolLM2-135M-Instruct

Or you can convert models from ModelScope with nexa convert -ms <ms-model-id>:

nexa convert -ms Qwen/Qwen2.5-7B-Instruct

Note: Check our leaderboard for performance benchmarks of different quantized versions of mainstream language models and HuggingFace docs to learn about quantization options.

📋 You can view downloaded and converted models with nexa list

Documentation

[!NOTE]

  1. If you want to use ONNX model, just replace pip install nexaai with pip install "nexaai[onnx]" in provided commands.
  2. If you want to run benchmark evaluation, just replace pip install nexaai with pip install "nexaai[eval]" in provided commands.
  3. If you want to convert and quantize huggingface models to GGUF models, just replace pip install nexaai with pip install "nexaai[convert]" in provided commands.
  4. For Chinese developers, we recommend you to use Tsinghua Open Source Mirror as extra index url, just replace --extra-index-url https://pypi.org/simple with --extra-index-url https://pypi.tuna.tsinghua.edu.cn/simple in provided commands.

CLI Reference

Here's a brief overview of the main CLI commands:

  • nexa run: Run inference for various tasks using GGUF models.
  • nexa onnx: Run inference for various tasks using ONNX models.
  • nexa convert: Convert and quantize huggingface models to GGUF models.
  • nexa server: Run the Nexa AI Text Generation Service.
  • nexa eval: Run the Nexa AI Evaluation Tasks.
  • nexa pull: Pull a model from official or hub.
  • nexa remove: Remove a model from local machine.
  • nexa clean: Clean up all model files.
  • nexa list: List all models in the local machine.
  • nexa login: Login to Nexa API.
  • nexa whoami: Show current user information.
  • nexa logout: Logout from Nexa API.

For detailed information on CLI commands and usage, please refer to the CLI Reference document.

Start Local Server

To start a local server using models on your local computer, you can use the nexa server command. For detailed information on server setup, API endpoints, and usage examples, please refer to the Server Reference document.

Swift Package

Swift SDK: Provides a Swifty API, allowing Swift developers to easily integrate and use llama.cpp models in their projects.

More Docs

Acknowledgements

We would like to thank the following projects:

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for nexa-sdk

Similar Open Source Tools

For similar tasks

For similar jobs