omlx

omlx

LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar

Stars: 54

Visit
 screenshot

oMLX is a macOS application optimized for running Large Language Models (LLMs) efficiently on Apple Silicon devices. It offers features like continuous batching, SSD caching, multi-model serving, admin dashboard, and API compatibility. Users can manage LLMs directly from the menu bar, download models from HuggingFace, and utilize Claude Code optimization. The tool supports various model families and structured output formats, making it convenient for real coding work and model management.

README:

oMLX

oMLX

LLM inference, optimized for your Mac
Continuous batching and infinite SSD caching, managed directly from your menu bar.

License Python 3.10+ Apple Silicon Buy Me a Coffee

Install · Quickstart · GitHub


oMLX Admin Dashboard

Every LLM server I tried made me choose between convenience and control. I wanted to pin everyday models in memory, auto-swap heavier ones on demand, set context limits - and manage it all from a menu bar.

oMLX persists KV cache to SSD - even when context changes mid-conversation, all past context stays cached and reusable across requests, making local LLMs practical for real coding work with tools like Claude Code. That's why I built it.

Install

macOS App

Download the .dmg from Releases, drag to Applications, done.

From Source

git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e .

Requires Python 3.10+ and Apple Silicon (M1/M2/M3/M4).

Quickstart

macOS App

Launch oMLX from your Applications folder. The Welcome screen guides you through three steps - model directory, server start, and first model download. That's it.

oMLX Welcome Screen oMLX Welcome Screen

CLI

omlx serve --model-dir ~/models

The server discovers models from subdirectories automatically. Any OpenAI-compatible client can connect to http://localhost:8000/v1. A built-in chat UI is also available at http://localhost:8000/admin/chat.

Features

oMLX is built on top of vllm-mlx, extending it with paged SSD caching, multi-model serving, an admin dashboard, Claude Code optimization, and Anthropic API support. Currently supports text-based LLMs - VLM and OCR model support is planned for upcoming milestones.

macOS menubar app - Native menubar app to start, stop, and monitor the server without opening a terminal.

Admin dashboard - Web UI at /admin for model management, chat, real-time monitoring, and per-model settings.

Built-in model downloader - Search and download MLX models from HuggingFace directly in the admin dashboard. No CLI or git clone needed.

oMLX Model Downloader

Claude Code optimization - Context scaling support for running smaller context models with Claude Code. Scales reported token counts so that auto-compact triggers at the right timing, and SSE keep-alive prevents read timeouts during long prefill.

Paged KV cache with SSD tiering - Block-based cache management inspired by vLLM, with prefix sharing and Copy-on-Write. When GPU memory fills up, blocks are offloaded to SSD. On the next request with a matching prefix, they're restored from disk instead of recomputed from scratch - even after a server restart.

Continuous batching - Handles concurrent requests through mlx-lm's BatchGenerator. Prefill and completion batch sizes are configurable.

Multi-model serving - Load LLMs, embedding models, and rerankers within the same server. Least-recently-used models are evicted automatically when memory runs low. Pin frequently used models to keep them loaded.

API compatibility - Drop-in replacement for OpenAI and Anthropic APIs.

Endpoint Description
POST /v1/chat/completions Chat completions (streaming)
POST /v1/completions Text completions (streaming)
POST /v1/messages Anthropic Messages API
POST /v1/embeddings Text embeddings
POST /v1/rerank Document reranking
GET /v1/models List available models

Tool calling & structured output - Supports all function calling formats available in mlx-lm, JSON schema validation, and MCP tool integration. Tool calling requires the model's chat template to support the tools parameter. The following model families are auto-detected via mlx-lm's built-in tool parsers:

Model Family Format
Llama, Qwen, DeepSeek, etc. JSON <tool_call>
Qwen3 Coder XML <function=...>
Gemma <start_function_call>
GLM (4.7, 5) <arg_key>/<arg_value> XML
MiniMax Namespaced <minimax:tool_call>
Mistral [TOOL_CALLS]
Kimi K2 <|tool_calls_section_begin|>
Longcat <longcat_tool_call>

Models not listed above may still work if their chat template accepts tools and their output uses a recognized <tool_call> XML format. Streaming requests with tool calls buffer all content and emit results at completion.

Models

Point --model-dir at a directory containing MLX-format model subdirectories:

~/models/
├── Step-3.5-Flash-8bit/
├── Qwen3-Coder-Next-8bit/
├── gpt-oss-120b-MXFP4-Q8/
└── bge-m3/

Models are auto-detected by type. You can also download models directly from the admin dashboard.

Type Models
LLM Any model supported by mlx-lm
Embedding BERT, BGE-M3, ModernBERT
Reranker ModernBERT, XLM-RoBERTa

CLI Configuration

# Memory limit for loaded models
omlx serve --model-dir ~/models --max-model-memory 32GB

# Enable SSD cache for KV blocks
omlx serve --model-dir ~/models --paged-ssd-cache-dir ~/.omlx/cache

# Adjust batch sizes
omlx serve --model-dir ~/models --prefill-batch-size 8 --completion-batch-size 32

# With MCP tools
omlx serve --model-dir ~/models --mcp-config mcp.json

# API key authentication
omlx serve --model-dir ~/models --api-key your-secret-key

All settings can also be configured from the web admin panel at /admin. Settings are persisted to ~/.omlx/settings.json, and CLI flags take precedence.

Architecture
FastAPI Server (OpenAI / Anthropic API)
    │
    ├── EnginePool (multi-model, LRU eviction)
    │   ├── BatchedEngine (LLMs, continuous batching)
    │   ├── EmbeddingEngine
    │   └── RerankerEngine
    │
    ├── Scheduler (FCFS, configurable batch sizes)
    │   └── mlx-lm BatchGenerator
    │
    └── Cache Stack
        ├── PagedCacheManager (GPU, block-based, CoW, prefix sharing)
        └── PagedSSDCacheManager (SSD tier, safetensors format)

Development

CLI Server

git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e ".[dev]"
pytest -m "not slow"

macOS App

Requires Python 3.11+ and venvstacks (pip install venvstacks).

cd packaging

# Full build (venvstacks + app bundle + DMG)
python build.py

# Skip venvstacks (code changes only)
python build.py --skip-venv

# DMG only
python build.py --dmg-only

See packaging/README.md for details on the app bundle structure and layer configuration.

Contributing

We welcome contributions! See Contributing Guide for details.

  • Bug fixes and improvements
  • Performance optimizations
  • Documentation improvements

License

Apache 2.0

Acknowledgments

  • MLX and mlx-lm by Apple
  • vllm-mlx - oMLX originated as a fork of vllm-mlx v0.1.0, since re-architected with multi-model serving, paged SSD caching, an admin panel, and a standalone macOS menu bar app
  • venvstacks - Portable Python environment layering for the macOS app bundle
  • mlx-embeddings - Embedding model support for Apple Silicon

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for omlx

Similar Open Source Tools

For similar tasks

For similar jobs