
mistral.rs
Blazingly fast LLM inference.
Stars: 6101

Mistral.rs is a fast LLM inference platform written in Rust. We support inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.
README:
| Rust Documentation | Python Documentation | Discord | Matrix |
Mistral.rs is a cross-platform, highly-multimodal inference engine that brings you:
- All-in-one multimodal workflow: text↔text, text+vision↔text, text+vision+audio↔text, text→speech, text→image
- APIs: Rust, Python, OpenAI HTTP server (with Chat Completions, Responses API), MCP server
- 🔗 MCP Client: Connect to external tools and services automatically (file systems, web search, databases, APIs)
- Performance: ISQ, PagedAttention, FlashAttention
Please submit requests for new models here.
-
Deploy with our easy to use APIs
- Python
- Rust
- OpenAI-compatible HTTP server
- Interactive mode
- 🔗 MCP Client - Connect to external tools automatically
-
Try the web chat app for local in-browser conversation (text, vision, and speech support):
- Quickstart here
- Run the server and visit http://localhost:8080 by default.
🖥️ Web Chat App

Try our modern in-browser chat with text, vision, and speech support (TTS generation).
After following installation instructions
-
💎🪆💎🪆💎 Run the Gemma 3n family (E2B, E4B) with vision, audio, and MatFormer support: documentation
Show commands
Normal use, run the full model (E4B or E2B):
./mistralrs-server -i --isq 8 run -m google/gemma-3n-E4B-it
Use MatFormer to get a balanced smaller model:
./mistralrs-server -i --isq 8 run -m google/gemma-3n-E4B-it \ --matformer-config-path matformer_configs/gemma3n.csv \ --matformer-slice-name "Config for E2.49B (block-level)"
-
🤗🤗🤗 Run the SmolLM 3 long-context hybrid-reasoning model with full tool-calling support: documentation
Show command
Default, easiest:
./mistralrs-server -i --isq 8 run -m HuggingFaceTB/SmolLM3-3B
UQFF prequantized:
./mistralrs-server -i run -m EricB/SmolLM3-3B-UQFF -f smollm33b-q4k-0.uqff
-
🔊 Run the Dia 1.6b model for highly-realistic dialogue generation: documentation
Show command
./mistralrs-server -i speech -m nari-labs/Dia-1.6B -a dia
-
🦙 Run the Llama 3.* and Llama 4 models with long context & vision support: docs (llama 3.2), docs (llama 4)
Show commands
Llama 4:
./mistralrs-server -i --isq 4 run -m meta-llama/Llama-4-Scout-17B-16E-Instruct
Llama 3.1/3.2/3.3:
./mistralrs-server -i --isq 8 run -m meta-llama/Llama-3.2-3B-Instruct
Llama 3.2 vision:
./mistralrs-server -i --isq 8 run -m meta-llama/Llama-3.2-11B-Vision-Instruct
-
💎💎💎 Run the Gemma 3 family (1b, 4b, 12b, 27b) with 128k context & vision support: documentation
Show command
./mistralrs-server -i --isq 8 run -m google/gemma-3-4b-it
-
🌲📷 Run the FLUX.1 diffusion model: documentation
Show command
./mistralrs-server -i diffusion -m black-forest-labs/FLUX.1-schnell -a flux
-
🧠 Run the Qwen 3 hybrid-reasoning model with full tool-calling support: documentation
Show command
./mistralrs-server -i --isq 8 run -m Qwen/Qwen3-8B
-
🔗 MCP Client - Connect to external tools and services automatically: Quick Start Guide
Show examples
1. Create config file (
mcp-config.json
):{ "servers": [{ "name": "Filesystem Tools", "source": { "type": "Process", "command": "npx", "args": ["@modelcontextprotocol/server-filesystem", "/tmp", "-y"] } }], "auto_register_tools": true }
2. Start server with tools:
./mistralrs-server --mcp-config mcp-config.json --port 1234 run -m Qwen/Qwen3-4B
3. Tools work automatically:
curl -X POST http://localhost:1234/v1/chat/completions \ -d '{"model":"Qwen/Qwen3-4B","messages":[{"role":"user","content":"List files in /tmp and create hello.txt"}]}'
Python API:
mcp_config = mistralrs.McpClientConfigPy( servers=[mistralrs.McpServerConfigPy( name="Filesystem", source=mistralrs.McpServerSourcePy.Process( command="npx", args=["@modelcontextprotocol/server-filesystem", "/tmp", "-y"] ) )], auto_register_tools=True ) runner = mistralrs.Runner( which=mistralrs.Which.Plain(model_id="Qwen/Qwen3-4B"), mcp_client_config=mcp_config ) # Tools automatically available!
Rust API:
let model = TextModelBuilder::new("Qwen/Qwen3-4B") .with_mcp_client(mcp_config) // Tools automatically available! .build().await?;
mistral.rs is a blazing-fast, cross-platform LLM inference engine with support for text, vision, image generation, and speech.
Key Benefits:
-
Ease of Use
- OpenAI-compatible HTTP server
- Rust API & Python API
- Automatic device mapping (multi-GPU, CPU)
- Chat templates & tokenizer auto-detection
- MCP server for structured, realtime tool calls
- ⭐ MCP client to connect to external tools and services automatically
-
Performance
- CPU acceleration (MKL, AVX, NEON, Accelerate)
- GPU acceleration (CUDA with FlashAttention & cuDNN, Metal)
- Automatic tensor parallelism for splitting models across multiple devices
- CUDA-specialized NCCL
- Heterogeneous, flexible Ring backend
-
Quantization
- In-place quantization (ISQ) of Hugging Face models
- GGML & GGUF support: 2–8 bit
- GPTQ, AWQ, AFQ, HQQ, FP8, BNB (int8/fp4/nf4)
- ⭐ Auto-select the fastest quant method
- KV cache quantization
-
Flexibility
- LoRA & X-LoRA adapters with weight merging
- AnyMoE: create MoE models on any base model
- Sampling & penalty options
- Prompt chunking for large inputs
- Integrated tool calling with customizable Python/Rust native tool and search callbacks
-
Advanced Features
- High-throughput with PagedAttention & FlashAttention V2/V3
- Prefix caching (including multimodal)
- Customizable quantization with topology & UQFF format
- Speculative decoding across models
- ⭐ Agentic web search integration
Rust multithreaded/async API for easy integration into any application.
- Docs
- Examples including MCP client integration
- To use: add
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git" }
to your Cargo.toml - MCP Client: Connect to external tools automatically - Quick Start
Python API for mistral.rs.
- Installation including PyPI
- Docs
- Examples including MCP client usage
- Cookbook
- MCP Client: Full MCP integration - Quick Start
OpenAI API compatible API server
- API Docs - includes chat completions, completions, and Responses API for stateful conversations
- Launching the server or use the CLI
- Example
- Responses API examples - maintain conversation context without resending history
- Use or extend the server in other axum projects
-
MCP Client: Configure via
--mcp-config
flag for automatic tool integration - Quick Start
Serve the same models over the open MCP (Model Context Protocol) in parallel to the HTTP API:
./mistralrs-server --mcp-port 4321 plain -m Qwen/Qwen3-4B
See the docs for feature flags, examples and limitations.
Accelerator | Feature Flag | Additional Flags |
---|---|---|
NVIDIA GPUs (CUDA) | cuda |
flash-attn , flash-attn-v3 , cudnn
|
Apple Silicon GPU (Metal) | metal |
|
CPU (Intel) | mkl |
|
CPU (Apple Accelerate) | accelerate |
|
Generic CPU (ARM/AVX) | none | ARM NEON / AVX enabled by default |
To enable one or more features, pass them to Cargo. For example:
cargo build --release --features "cuda flash-attn cudnn"
Note for Linux users: The
metal
feature is macOS-only and should not be used on Linux. Use--features "cuda flash-attn cudnn"
for NVIDIA GPUs or--features mkl
for Intel CPUs instead of--all-features
.
Note: You can use our Docker containers here. Learn more about running Docker containers: https://docs.docker.com/engine/reference/run/
- Install the Python package here.
- The Python package has wheels on PyPi!
-
Install required packages:
-
OpenSSL
(Example on Ubuntu:sudo apt install libssl-dev
) -
Linux only:
pkg-config
(Example on Ubuntu:sudo apt install pkg-config
)
-
-
Install Rust: https://rustup.rs/
Example on Ubuntu:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source $HOME/.cargo/env
-
Optional: Set HF token correctly (skip if already set or your model is not gated, or if you want to use the
token_source
parameters in Python or the command line.)- Note: you can install
huggingface-cli
as documented here.
huggingface-cli login
- Note: you can install
-
Download the code:
git clone https://github.com/EricLBuehler/mistral.rs.git cd mistral.rs
-
Build or install
mistralrs-server
:-
Build the
mistralrs-server
binary, which can be found attarget/release/mistralrs-server
.cargo build --release --features <specify feature(s) here>
-
Install with
cargo install
for easy command line usagePass the same values to
--features
as you would forcargo build
cargo install --path mistralrs-server --features <specify feature(s) here>
-
-
(If you used
cargo build
) The build process will output a binarymistralrs-server
at./target/release/mistralrs-server
. We can switch to that directory so that the binary can be accessed as./mistralrs-server
with the following command:Example on Ubuntu:
cd target/release
-
Use our APIs and integrations:
Show: How to get models (Hub, local, GGUF, adapters, etc.)
- Default: Downloads from Hugging Face Hub.
- For gated models, you can optionally set token source:
- CLI:
./mistralrs-server --token-source env:HF_TOKEN ...
- Python: See examples/python/token_source.py
- If no token is found, tries
~/.cache/huggingface/token
or runs with no token.
- CLI:
- Pass a path to a downloaded model from Hugging Face hub:
- Example:
./mistralrs-server -i run -m path/to/model
- Example:
- Minimal example:
./mistralrs-server gguf -m author/model-repo -f model-quant.gguf
- Specify tokenizer (if needed):
(Or use the built-in GGUF tokenizer.)./mistralrs-server gguf -m author/model-repo -f file.gguf -t author/official-tokenizer
- Use the correct subcommand (
x-lora-*
,lora-*
), pass model, adapter, or quant file as needed. - See docs/ADAPTER_MODELS.md for details.
- For chat templates: usually auto-detected, override with
--chat-template <file>
.
See docs/CHAT_TOK.md.
- See Run with the CLI below or full documentation.
Mistral.rs uses subcommands to control the model type. Please run ./mistralrs-server --help
to see the subcommands which categorize the models by kind.
🚨 Important: The
run
subcommand (alias forplain
/vision-plain
) only auto-detects and runs text and vision models. It does not support diffusion or speech models. To run a diffusion model (e.g. FLUX series), use thediffusion
subcommand:mistralrs-server -i diffusion -m <model-id> [options]To run a speech model (e.g. Dia), use the
speech
subcommand:mistralrs-server -i speech -m <model-id> [options]If you attempt to use
run
with diffusion or speech models, model loading will fail.
Llama 3.2 3B running on an M3 Max with 8-bit ISQ:
You can launch interactive mode, a simple chat application running in the terminal, by passing -i
:
./mistralrs-server -i plain -m meta-llama/Llama-3.2-3B-Instruct
Vision models work seamlessly:
./mistralrs-server -i vision-plain -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k
Diffusion models can be run too (quantization and adapters are not yet supported):
./mistralrs-server -i diffusion -m black-forest-labs/FLUX.1-schnell -a flux
And you can run speech generation in your terminal!
./mistralrs-server -i speech -m nari-labs/Dia-1.6B -a dia
You can launch an HTTP server by replacing -i
with --port <port>
. For instance:
./mistralrs-server --port 1234 run -m microsoft/Phi-3.5-MoE-instruct
You can find documentation about the server itself here.
Serve multiple models simultaneously from a single server instance. Perfect for comparing models, A/B testing, or serving different models for different use cases.
./mistralrs-server --port 1234 multi-model --config example-multi-model-config.json --default-model-id meta-llama/Llama-3.2-3B-Instruct
Select models in your requests using the model
parameter:
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.2-3B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'
📖 Complete multi-model documentation →
We provide a method to select models with a .toml
file. The keys are the same as the command line, with no_kv_cache
and tokenizer_json
being "global" keys.
Example:
./mistralrs-server --port 1234 toml -f toml-selectors/gguf.toml
Note: for plain models, you can specify the data type to load and run in. This must be one of
f32
,f16
,bf16
orauto
to choose based on the device. This is specified in the--dype
/-d
parameter after the model architecture (plain
). For quantized models (gguf/ggml), you may specify data type off32
orbf16
(f16
is not recommended due to its lower precision in quantized inference).
If you do not specify the architecture, an attempt will be made to use the model's config. If this fails, please raise an issue.
Show plain architectures
mistral
gemma
mixtral
llama
phi2
phi3
phi3.5moe
qwen2
gemma2
glm4
starcoder2
deepseekv2
deepseekv3
qwen3
qwen3moe
smollm3
Note: for vision models, you can specify the data type to load and run in. This must be one of
f32
,f16
,bf16
orauto
to choose based on the device. This is specified in the--dype
/-d
parameter after the model architecture (vision-plain
).
Show vision architectures
phi3v
idefics2
llava_next
llava
vllama
qwen2vl
idefics3
minicpmo
phi4mm
qwen2_5vl
gemma3
mistral3
llama4
gemma3n
Show supported GGUF architectures
Plain:
- llama
- phi2
- phi3
- starcoder2
- qwen2
- qwen3
With adapters:
- llama
- phi3
Please submit more benchmarks via raising an issue!
Show quantization support
Quantization support
Model | GGUF | GGML | ISQ |
---|---|---|---|
Mistral | ✅ | ✅ | |
Gemma | ✅ | ||
Llama | ✅ | ✅ | ✅ |
Mixtral | ✅ | ✅ | |
Phi 2 | ✅ | ✅ | |
Phi 3 | ✅ | ✅ | |
Phi 3.5 MoE | ✅ | ||
Qwen 2.5 | ✅ | ||
Phi 3 Vision | ✅ | ||
Idefics 2 | ✅ | ||
Gemma 2 | ✅ | ||
GLM4 | ✅ | ||
Starcoder 2 | ✅ | ✅ | |
LLaVa Next | ✅ | ||
LLaVa | ✅ | ||
Llama 3.2 Vision | ✅ | ||
Qwen2-VL | ✅ | ||
Idefics 3 | ✅ | ||
Deepseek V2 | ✅ | ||
Deepseek V3 | ✅ | ||
MiniCPM-O 2.6 | ✅ | ||
Qwen2.5-VL | ✅ | ||
Gemma 3 | ✅ | ||
Mistral 3 | ✅ | ||
Llama 4 | ✅ | ||
Qwen 3 | ✅ | ✅ | |
SmolLM3 | ✅ | ||
Dia 1.6b | ✅ | ||
Gemma 3n | ✅ |
Show device mapping support
Device mapping support
Model category | Supported |
---|---|
Plain | ✅ |
GGUF | ✅ |
GGML | |
Vision Plain | ✅ |
Show X-LoRA and LoRA support
X-LoRA and LoRA support
Model | X-LoRA | X-LoRA+GGUF | X-LoRA+GGML |
---|---|---|---|
Mistral | ✅ | ✅ | |
Gemma | ✅ | ||
Llama | ✅ | ✅ | ✅ |
Mixtral | ✅ | ✅ | |
Phi 2 | ✅ | ||
Phi 3 | ✅ | ✅ | |
Phi 3.5 MoE | |||
Qwen 2.5 | |||
Phi 3 Vision | |||
Idefics 2 | |||
Gemma 2 | ✅ | ||
GLM4 | ✅ | ||
Starcoder 2 | ✅ | ||
LLaVa Next | |||
LLaVa | |||
Qwen2-VL | |||
Idefics 3 | |||
Deepseek V2 | |||
Deepseek V3 | |||
MiniCPM-O 2.6 | |||
Qwen2.5-VL | |||
Gemma 3 | |||
Mistral 3 | |||
Llama 4 | |||
Qwen 3 | |||
SmolLM3 | ✅ | ||
Gemma 3n |
Show AnyMoE support
AnyMoE support
Model | AnyMoE |
---|---|
Mistral 7B | ✅ |
Gemma | ✅ |
Llama | ✅ |
Mixtral | |
Phi 2 | ✅ |
Phi 3 | ✅ |
Phi 3.5 MoE | |
Qwen 2.5 | ✅ |
Phi 3 Vision | |
Idefics 2 | |
Gemma 2 | ✅ |
Starcoder 2 | ✅ |
LLaVa Next | ✅ |
LLaVa | ✅ |
Llama 3.2 Vision | |
Qwen2-VL | |
Idefics 3 | ✅ |
Deepseek V2 | |
Deepseek V3 | |
MiniCPM-O 2.6 | |
Qwen2.5-VL | |
Gemma 3 | ✅ |
Mistral 3 | ✅ |
Llama 4 | |
Qwen 3 | |
SmolLM3 | ✅ |
Gemma 3n |
To use a derivative or adapter model (e.g., quantized, LoRA, X-LoRA, vision, etc.), select the correct architecture subcommand and pass the required arguments—typically model id, and for quantized/adapters, also the quantization filename, tokenizer, or adapter ordering if needed.
-
See all options: Run
./mistralrs-server <subcommand> --help
- Docs: Adapter models, Chat templates
Arguments by model type
Model Type | Required Arguments |
---|---|
Plain | model id |
Quantized | model id, quantized filename, tokenizer id |
X-LoRA | model id, X-LoRA ordering (if not default) |
X-LoRA quantized | model id, quantized filename, tokenizer id, X-LoRA ordering |
LoRA | model id, LoRA ordering (if not default) |
LoRA quantized | model id, quantized filename, tokenizer id, LoRA ordering |
Vision Plain | model id |
Example: Zephyr GGUF model
./mistralrs-server --port 1234 --log output.txt gguf -t HuggingFaceH4/zephyr-7b-beta -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q5_0.gguf
Chat template and tokenizer are usually auto-detected.
If you need to override, see the chat templates doc.
An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting the x-lora-*
architecture, and LoRA support by selecting the lora-*
architecture. Please find docs for adapter models here. Examples may be found here.
Mistral.rs will attempt to automatically load a chat template and tokenizer. This enables high flexibility across models and ensures accurate and flexible chat templating. However, this behavior can be customized. Please find detailed documentation here.
Thank you for contributing! If you have any problems or want to contribute something, please raise an issue or pull request. If you want to add a new model, please contact us via an issue and we can coordinate how to do this.
- Debugging with the environment variable
MISTRALRS_DEBUG=1
causes the following things- If loading a GGUF or GGML model, this will output a file containing the names, shapes, and types of each tensor.
-
mistralrs_gguf_tensors.txt
ormistralrs_ggml_tensors.txt
-
- More logging.
- If loading a GGUF or GGML model, this will output a file containing the names, shapes, and types of each tensor.
- Setting the CUDA compiler path:
- Set the
NVCC_CCBIN
environment variable during build.
- Set the
- Error:
recompile with -fPIE
:- Some Linux distributions require compiling with
-fPIE
. - Set the
CUDA_NVCC_FLAGS
environment variable to-fPIE
during build:CUDA_NVCC_FLAGS=-fPIE
- Some Linux distributions require compiling with
- Error
CUDA_ERROR_NOT_FOUND
or symbol not found when using a normal or vison model:- For non-quantized models, you can specify the data type to load and run in. This must be one of
f32
,f16
,bf16
orauto
to choose based on the device.
- For non-quantized models, you can specify the data type to load and run in. This must be one of
- What is the minimum supported CUDA compute cap?
- The minimum CUDA compute cap is 5.3.
- Metal not found (error: unable to find utility "metal", not a developer tool or in PATH)
- Install Xcode:
xcode-select --install
- Set the active developer directory:
sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer
- Install Xcode:
- Disabling Metal kernel precompilation:
- By default, Metal kernels are precompiled during build time for better performance
- To skip Metal kernel precompilation (useful for CI or when Metal is not needed), set
MISTRALRS_METAL_PRECOMPILE=0
orMISTRALRS_METAL_PRECOMPILE=false
- Example:
MISTRALRS_METAL_PRECOMPILE=0 cargo build --release --features metal
This project would not be possible without the excellent work at candle
. Additionally, thank you to all contributors! Contributing can range from raising an issue or suggesting a feature to adding some new functionality.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for mistral.rs
Similar Open Source Tools

mistral.rs
Mistral.rs is a fast LLM inference platform written in Rust. We support inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.

nvim-aider
Nvim-aider is a plugin for Neovim that provides additional functionality and key mappings to enhance the user's editing experience. It offers features such as code navigation, quick access to commonly used commands, and improved text manipulation tools. With Nvim-aider, users can streamline their workflow and increase productivity while working with Neovim.

jadx-mcp-server
JADX-MCP-SERVER is a standalone Python server that interacts with JADX-AI-MCP Plugin to analyze Android APKs using LLMs like Claude. It enables live communication with decompiled Android app context, uncovering vulnerabilities, parsing manifests, and facilitating reverse engineering effortlessly. The tool combines JADX-AI-MCP and JADX MCP SERVER to provide real-time reverse engineering support with LLMs, offering features like quick analysis, vulnerability detection, AI code modification, static analysis, and reverse engineering helpers. It supports various MCP tools for fetching class information, text, methods, fields, smali code, AndroidManifest.xml content, strings.xml file, resource files, and more. Tested on Claude Desktop, it aims to support other LLMs in the future, enhancing Android reverse engineering and APK modification tools connectivity for easier reverse engineering purely from vibes.

LocalLLMClient
LocalLLMClient is a Swift package designed to interact with local Large Language Models (LLMs) on Apple platforms. It supports GGUF, MLX models, and the FoundationModels framework, providing streaming API, multimodal capabilities, and tool calling functionalities. Users can easily integrate this tool to work with various models for text generation and processing. The package also includes advanced features for low-level API control and multimodal image processing. LocalLLMClient is experimental and subject to API changes, offering support for iOS, macOS, and Linux platforms.

holmesgpt
HolmesGPT is an open-source DevOps assistant powered by OpenAI or any tool-calling LLM of your choice. It helps in troubleshooting Kubernetes, incident response, ticket management, automated investigation, and runbook automation in plain English. The tool connects to existing observability data, is compliance-friendly, provides transparent results, supports extensible data sources, runbook automation, and integrates with existing workflows. Users can install HolmesGPT using Brew, prebuilt Docker container, Python Poetry, or Docker. The tool requires an API key for functioning and supports OpenAI, Azure AI, and self-hosted LLMs.

ai21-python
The AI21 Labs Python SDK is a comprehensive tool for interacting with the AI21 API. It provides functionalities for chat completions, conversational RAG, token counting, error handling, and support for various cloud providers like AWS, Azure, and Vertex. The SDK offers both synchronous and asynchronous usage, along with detailed examples and documentation. Users can quickly get started with the SDK to leverage AI21's powerful models for various natural language processing tasks.

jadx-ai-mcp
JADX-AI-MCP is a plugin for the JADX decompiler that integrates with Model Context Protocol (MCP) to provide live reverse engineering support with LLMs like Claude. It allows for quick analysis, vulnerability detection, and AI code modification, all in real time. The tool combines JADX-AI-MCP and JADX MCP SERVER to analyze Android APKs effortlessly. It offers various prompts for code understanding, vulnerability detection, reverse engineering helpers, static analysis, AI code modification, and documentation. The tool is part of the Zin MCP Suite and aims to connect all android reverse engineering and APK modification tools with a single MCP server for easy reverse engineering of APK files.

enterprise-h2ogpte
Enterprise h2oGPTe - GenAI RAG is a repository containing code examples, notebooks, and benchmarks for the enterprise version of h2oGPTe, a powerful AI tool for generating text based on the RAG (Retrieval-Augmented Generation) architecture. The repository provides resources for leveraging h2oGPTe in enterprise settings, including implementation guides, performance evaluations, and best practices. Users can explore various applications of h2oGPTe in natural language processing tasks, such as text generation, content creation, and conversational AI.

gorilla
Gorilla is a tool that enables LLMs to use tools by invoking APIs. Given a natural language query, Gorilla comes up with the semantically- and syntactically- correct API to invoke. With Gorilla, you can use LLMs to invoke 1,600+ (and growing) API calls accurately while reducing hallucination. Gorilla also releases APIBench, the largest collection of APIs, curated and easy to be trained on!

baibot
Baibot is a versatile chatbot framework designed to simplify the process of creating and deploying chatbots. It provides a user-friendly interface for building custom chatbots with various functionalities such as natural language processing, conversation flow management, and integration with external APIs. Baibot is highly customizable and can be easily extended to suit different use cases and industries. With Baibot, developers can quickly create intelligent chatbots that can interact with users in a seamless and engaging manner, enhancing user experience and automating customer support processes.

MaxKB
MaxKB is a knowledge base Q&A system based on the LLM large language model. MaxKB = Max Knowledge Base, which aims to become the most powerful brain of the enterprise.

aide
Aide is a code-first API documentation and utility library for Rust, along with other related utility crates for web-servers. It provides tools for creating API documentation and handling JSON request validation. The repository contains multiple crates that offer drop-in replacements for existing libraries, ensuring compatibility with Aide. Contributions are welcome, and the code is dual licensed under MIT and Apache-2.0. If Aide does not meet your requirements, you can explore similar libraries like paperclip, utoipa, and okapi.

databerry
Chaindesk is a no-code platform that allows users to easily set up a semantic search system for personal data without technical knowledge. It supports loading data from various sources such as raw text, web pages, files (Word, Excel, PowerPoint, PDF, Markdown, Plain Text), and upcoming support for web sites, Notion, and Airtable. The platform offers a user-friendly interface for managing datastores, querying data via a secure API endpoint, and auto-generating ChatGPT Plugins for each datastore. Chaindesk utilizes a Vector Database (Qdrant), Openai's text-embedding-ada-002 for embeddings, and has a chunk size of 1024 tokens. The technology stack includes Next.js, Joy UI, LangchainJS, PostgreSQL, Prisma, and Qdrant, inspired by the ChatGPT Retrieval Plugin.

chatluna
Chatluna is a machine learning model plugin that provides chat services with large language models. It is highly extensible, supports multiple output formats, and offers features like custom conversation presets, rate limiting, and context awareness. Users can deploy Chatluna under Koishi without additional configuration. The plugin supports various models/platforms like OpenAI, Azure OpenAI, Google Gemini, and more. It also provides preset customization using YAML files and allows for easy forking and development within Koishi projects. However, the project lacks web UI, HTTP server, and project documentation, inviting contributions from the community.

MateCat
Matecat is an enterprise-level, web-based CAT tool designed to make post-editing and outsourcing easy and to provide a complete set of features to manage and monitor translation projects.

bisheng
Bisheng is a leading open-source **large model application development platform** that empowers and accelerates the development and deployment of large model applications, helping users enter the next generation of application development with the best possible experience.
For similar tasks

LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

onnxruntime-genai
ONNX Runtime Generative AI is a library that provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. Users can call a high level `generate()` method, or run each iteration of the model in a loop. It supports greedy/beam search and TopP, TopK sampling to generate token sequences, has built in logits processing like repetition penalties, and allows for easy custom scoring.

jupyter-ai
Jupyter AI connects generative AI with Jupyter notebooks. It provides a user-friendly and powerful way to explore generative AI models in notebooks and improve your productivity in JupyterLab and the Jupyter Notebook. Specifically, Jupyter AI offers: * An `%%ai` magic that turns the Jupyter notebook into a reproducible generative AI playground. This works anywhere the IPython kernel runs (JupyterLab, Jupyter Notebook, Google Colab, Kaggle, VSCode, etc.). * A native chat UI in JupyterLab that enables you to work with generative AI as a conversational assistant. * Support for a wide range of generative model providers, including AI21, Anthropic, AWS, Cohere, Gemini, Hugging Face, NVIDIA, and OpenAI. * Local model support through GPT4All, enabling use of generative AI models on consumer grade machines with ease and privacy.

khoj
Khoj is an open-source, personal AI assistant that extends your capabilities by creating always-available AI agents. You can share your notes and documents to extend your digital brain, and your AI agents have access to the internet, allowing you to incorporate real-time information. Khoj is accessible on Desktop, Emacs, Obsidian, Web, and Whatsapp, and you can share PDF, markdown, org-mode, notion files, and GitHub repositories. You'll get fast, accurate semantic search on top of your docs, and your agents can create deeply personal images and understand your speech. Khoj is self-hostable and always will be.

langchain_dart
LangChain.dart is a Dart port of the popular LangChain Python framework created by Harrison Chase. LangChain provides a set of ready-to-use components for working with language models and a standard interface for chaining them together to formulate more advanced use cases (e.g. chatbots, Q&A with RAG, agents, summarization, extraction, etc.). The components can be grouped into a few core modules: * **Model I/O:** LangChain offers a unified API for interacting with various LLM providers (e.g. OpenAI, Google, Mistral, Ollama, etc.), allowing developers to switch between them with ease. Additionally, it provides tools for managing model inputs (prompt templates and example selectors) and parsing the resulting model outputs (output parsers). * **Retrieval:** assists in loading user data (via document loaders), transforming it (with text splitters), extracting its meaning (using embedding models), storing (in vector stores) and retrieving it (through retrievers) so that it can be used to ground the model's responses (i.e. Retrieval-Augmented Generation or RAG). * **Agents:** "bots" that leverage LLMs to make informed decisions about which available tools (such as web search, calculators, database lookup, etc.) to use to accomplish the designated task. The different components can be composed together using the LangChain Expression Language (LCEL).

danswer
Danswer is an open-source Gen-AI Chat and Unified Search tool that connects to your company's docs, apps, and people. It provides a Chat interface and plugs into any LLM of your choice. Danswer can be deployed anywhere and for any scale - on a laptop, on-premise, or to cloud. Since you own the deployment, your user data and chats are fully in your own control. Danswer is MIT licensed and designed to be modular and easily extensible. The system also comes fully ready for production usage with user authentication, role management (admin/basic users), chat persistence, and a UI for configuring Personas (AI Assistants) and their Prompts. Danswer also serves as a Unified Search across all common workplace tools such as Slack, Google Drive, Confluence, etc. By combining LLMs and team specific knowledge, Danswer becomes a subject matter expert for the team. Imagine ChatGPT if it had access to your team's unique knowledge! It enables questions such as "A customer wants feature X, is this already supported?" or "Where's the pull request for feature Y?"

infinity
Infinity is an AI-native database designed for LLM applications, providing incredibly fast full-text and vector search capabilities. It supports a wide range of data types, including vectors, full-text, and structured data, and offers a fused search feature that combines multiple embeddings and full text. Infinity is easy to use, with an intuitive Python API and a single-binary architecture that simplifies deployment. It achieves high performance, with 0.1 milliseconds query latency on million-scale vector datasets and up to 15K QPS.
For similar jobs

h2ogpt
h2oGPT is an Apache V2 open-source project that allows users to query and summarize documents or chat with local private GPT LLMs. It features a private offline database of any documents (PDFs, Excel, Word, Images, Video Frames, Youtube, Audio, Code, Text, MarkDown, etc.), a persistent database (Chroma, Weaviate, or in-memory FAISS) using accurate embeddings (instructor-large, all-MiniLM-L6-v2, etc.), and efficient use of context using instruct-tuned LLMs (no need for LangChain's few-shot approach). h2oGPT also offers parallel summarization and extraction, reaching an output of 80 tokens per second with the 13B LLaMa2 model, HYDE (Hypothetical Document Embeddings) for enhanced retrieval based upon LLM responses, a variety of models supported (LLaMa2, Mistral, Falcon, Vicuna, WizardLM. With AutoGPTQ, 4-bit/8-bit, LORA, etc.), GPU support from HF and LLaMa.cpp GGML models, and CPU support using HF, LLaMa.cpp, and GPT4ALL models. Additionally, h2oGPT provides Attention Sinks for arbitrarily long generation (LLaMa-2, Mistral, MPT, Pythia, Falcon, etc.), a UI or CLI with streaming of all models, the ability to upload and view documents through the UI (control multiple collaborative or personal collections), Vision Models LLaVa, Claude-3, Gemini-Pro-Vision, GPT-4-Vision, Image Generation Stable Diffusion (sdxl-turbo, sdxl) and PlaygroundAI (playv2), Voice STT using Whisper with streaming audio conversion, Voice TTS using MIT-Licensed Microsoft Speech T5 with multiple voices and Streaming audio conversion, Voice TTS using MPL2-Licensed TTS including Voice Cloning and Streaming audio conversion, AI Assistant Voice Control Mode for hands-free control of h2oGPT chat, Bake-off UI mode against many models at the same time, Easy Download of model artifacts and control over models like LLaMa.cpp through the UI, Authentication in the UI by user/password via Native or Google OAuth, State Preservation in the UI by user/password, Linux, Docker, macOS, and Windows support, Easy Windows Installer for Windows 10 64-bit (CPU/CUDA), Easy macOS Installer for macOS (CPU/M1/M2), Inference Servers support (oLLaMa, HF TGI server, vLLM, Gradio, ExLLaMa, Replicate, OpenAI, Azure OpenAI, Anthropic), OpenAI-compliant, Server Proxy API (h2oGPT acts as drop-in-replacement to OpenAI server), Python client API (to talk to Gradio server), JSON Mode with any model via code block extraction. Also supports MistralAI JSON mode, Claude-3 via function calling with strict Schema, OpenAI via JSON mode, and vLLM via guided_json with strict Schema, Web-Search integration with Chat and Document Q/A, Agents for Search, Document Q/A, Python Code, CSV frames (Experimental, best with OpenAI currently), Evaluate performance using reward models, and Quality maintained with over 1000 unit and integration tests taking over 4 GPU-hours.

mistral.rs
Mistral.rs is a fast LLM inference platform written in Rust. We support inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.

ollama
Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications. Ollama is designed to be easy to use and accessible to developers of all levels. It is open source and available for free on GitHub.

llama-cpp-agent
The llama-cpp-agent framework is a tool designed for easy interaction with Large Language Models (LLMs). Allowing users to chat with LLM models, execute structured function calls and get structured output (objects). It provides a simple yet robust interface and supports llama-cpp-python and OpenAI endpoints with GBNF grammar support (like the llama-cpp-python server) and the llama.cpp backend server. It works by generating a formal GGML-BNF grammar of the user defined structures and functions, which is then used by llama.cpp to generate text valid to that grammar. In contrast to most GBNF grammar generators it also supports nested objects, dictionaries, enums and lists of them.

llama_ros
This repository provides a set of ROS 2 packages to integrate llama.cpp into ROS 2. By using the llama_ros packages, you can easily incorporate the powerful optimization capabilities of llama.cpp into your ROS 2 projects by running GGUF-based LLMs and VLMs.

MITSUHA
OneReality is a virtual waifu/assistant that you can speak to through your mic and it'll speak back to you! It has many features such as: * You can speak to her with a mic * It can speak back to you * Has short-term memory and long-term memory * Can open apps * Smarter than you * Fluent in English, Japanese, Korean, and Chinese * Can control your smart home like Alexa if you set up Tuya (more info in Prerequisites) It is built with Python, Llama-cpp-python, Whisper, SpeechRecognition, PocketSphinx, VITS-fast-fine-tuning, VITS-simple-api, HyperDB, Sentence Transformers, and Tuya Cloud IoT.

wenxin-starter
WenXin-Starter is a spring-boot-starter for Baidu's "Wenxin Qianfan WENXINWORKSHOP" large model, which can help you quickly access Baidu's AI capabilities. It fully integrates the official API documentation of Wenxin Qianfan. Supports text-to-image generation, built-in dialogue memory, and supports streaming return of dialogue. Supports QPS control of a single model and supports queuing mechanism. Plugins will be added soon.

FlexFlow
FlexFlow Serve is an open-source compiler and distributed system for **low latency**, **high performance** LLM serving. FlexFlow Serve outperforms existing systems by 1.3-2.0x for single-node, multi-GPU inference and by 1.4-2.4x for multi-node, multi-GPU inference.