nexa-sdk

nexa-sdk

Run the latest LLMs and VLMs across GPU, NPU, and CPU with bindings for Python, Android Java, and iOS Swift, getting up and running quickly with OpenAI gpt-oss, Gemma 3, Qwen3, and more.

Stars: 4996

Visit
 screenshot

Nexa SDK is a comprehensive toolkit supporting ONNX and GGML models for text generation, image generation, vision-language models (VLM), and text-to-speech (TTS) capabilities. It offers an OpenAI-compatible API server with JSON schema mode and streaming support, along with a user-friendly Streamlit UI. Users can run Nexa SDK on any device with Python environment, with GPU acceleration supported. The toolkit provides model support, conversion engine, inference engine for various tasks, and differentiating features from other tools.

README:

Nexa AI Banner

Documentation X account Join us on Discord Join us on Slack

OS Hardware

Nexa SDK

Nexa SDK is an on-device inference framework that runs any model on any device, across any backend. It runs on CPUs, GPUs, NPUs with backend support for CUDA, Metal, Vulkan, and Qualcomm NPU. It handles multiple input modalities including text ๐Ÿ“, image ๐Ÿ–ผ๏ธ, and audio ๐ŸŽง. The SDK includes an OpenAI-compatible API server with support for JSON schema-based function calling and streaming. It supports model formats such as GGUF, MLX, Nexa AI's own .nexa format, enabling efficient quantized inference across diverse platforms.

Qualcomm NPU PC Demos

Multi-Image Reasoning Demo

๐Ÿ–ผ๏ธ Multi-Image Reasoning
Spot the difference across two images in multi-round dialogue.

Image + Audio Function Call Demo

๐ŸŽค Image + Text โ†’ Function Call
Snap a poster, add a voice note, and AI agent creates a calendar event.

Multi-Audio Comparison Demo

๐ŸŽถ Multi-Audio Comparison
Tell the difference between two music clips locally.

Recent updates

๐Ÿ“ฃ 2025.09.23: Intel NPU Support

๐Ÿ“ฃ 2025.09.22: Apple Neural Engine (ANE) Support

๐Ÿ“ฃ 2025.09.15: New Models Support

๐Ÿ“ฃ 2025.09.05: Turbo Engine & Unified Interface

  • Nexa ML Turbo engine for optimized NPU performance
  • Unified interface supporting NPU/GPU/CPU backends:
    • Single installer architecture eliminating dependency conflicts
    • Lazy loading and plugin isolation for improved performance

๐Ÿ“ฃ 2025.08.20: Qualcomm NPU Support with NexaML Turbo Engine

๐Ÿ“ฃ 2025.08.12: ASR & TTS Support in MLX format

  • Parakeet and Kokoro models support in MLX format.
  • new /mic mode to transcribe live speech directly in your terminal.

Installation

macOS

Windows

Linux

curl -fsSL https://github.com/NexaAI/nexa-sdk/releases/latest/download/nexa-cli_linux_x86_64.sh -o install.sh && chmod +x install.sh && ./install.sh && rm install.sh

Supported Models

You can run any compatible GGUF, MLX, or nexa model from ๐Ÿค— Hugging Face by using the <full repo name>.

Qualcomm NPU models

[!TIP] You need to download the arm64 with Qualcomm NPU support and make sure you have Snapdragonยฎ X Elite chip on your laptop.

Quick Start (Windows arm64, Snapdragon X Elite)

  1. Login & Get Access Token (required for Pro Models)

    • Create an account at sdk.nexa.ai
    • Go to Deployment โ†’ Create Token
    • Run this once in your terminal (replace with your token):
      nexa config set license '<your_token_here>'
  2. Run and chat with our multimodal model, OmniNeural-4B, or other models on NPU

nexa infer omni-neural
nexa infer NexaAI/OmniNeural-4B
nexa infer NexaAI/qwen3-1.7B-npu

GGUF models

[!TIP] GGUF runs on macOS, Linux, and Windows.

๐Ÿ“ Run and chat with LLMs, e.g. Qwen3:

nexa infer ggml-org/Qwen3-1.7B-GGUF

๐Ÿ–ผ๏ธ Run and chat with Multimodal models, e.g. Qwen2.5-Omni:

nexa infer NexaAI/Qwen2.5-Omni-3B-GGUF

MLX models

[!TIP] MLX is macOS-only (Apple Silicon). Many MLX models in the Hugging Face mlx-community organization have quality issues and may not run reliably. We recommend starting with models from our curated NexaAI Collection for best results. For example

๐Ÿ“ Run and chat with LLMs, e.g. Qwen3:

nexa infer NexaAI/Qwen3-4B-4bit-MLX

๐Ÿ–ผ๏ธ Run and chat with Multimodal models, e.g. Gemma3n:

nexa infer NexaAI/gemma-3n-E4B-it-4bit-MLX

CLI Reference

Essential Command What it does
nexa -h show all CLI commands
nexa pull <repo> Interactive download & cache of a model
nexa infer <repo> Local inference
nexa list Show all cached models with sizes
nexa remove <repo> / nexa clean Delete one / all cached models
nexa serve --host 127.0.0.1:8080 Launch OpenAIโ€‘compatible REST server
nexa run <repo> Chat with a model via an existing server

๐Ÿ‘‰ To interact with multimodal models, you can drag photos or audio clips directly into the CLI โ€” you can even drop multiple images at once!

See CLI Reference for full commands.

Acknowledgements

We would like to thank the following projects:

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for nexa-sdk

Similar Open Source Tools

For similar tasks

For similar jobs