
mistral.rs
Blazingly fast LLM inference.
Stars: 5396

Mistral.rs is a fast LLM inference platform written in Rust. We support inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.
README:
| Rust Documentation | Python Documentation | Discord | Matrix |
Please submit requests for new models here.
-
Deploy with our easy to use APIs
After following installation instructions
-
Check out UQFF for prequantized models of various methods!
- Models can be found here.
-
🔍🌐 Easily add web search capabilities to your models! Compatible with OpenAI's
web_search_options
parameter: documentation./mistralrs-server --enable-search --port 1234 --isq q4k plain -m NousResearch/Hermes-3-Llama-3.1-8B
-
💎💎💎 Run the entire Gemma 3 Model family (1b, 4b, 12b, 27b) with 128k context length and vision support: documentation
./mistralrs-server -i vision-plain -m google/gemma-3-4b-it -a gemma3
-
Run the Mistral 3 Model with 128k context length and strong vision support: documentation
./mistralrs-server -i --isq q4k vision-plain -m mistralai/Mistral-Small-3.1-24B-Instruct-2503 -a mistral3
-
🐋🐋🐋 Run the Deepseek R1/V3 model with automatic tensor parallelism: documentation
./mistralrs-server -i --isq Q4K plain -m deepseek-ai/DeepSeek-R1
-
🐋🐋🐋 Run the Deepseek R1 distillations out of the box
./mistralrs-server -i --isq Q4K plain -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B ./mistralrs-server -i --isq Q4K plain -m deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B ./mistralrs-server -i --isq Q4K plain -m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
-
🦙📷 Run the Llama 3.2 Vision Model: documentation and guide here
./mistralrs-server -i vision-plain -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k -a vllama
-
φ⁴ 📷 Run the Phi 4 Multimodal model: documentation and guide here
./mistralrs-server -i vision-plain -m microsoft/Phi-4-multimodal-instruct -a phi4mm
-
φ⁴ Run the new Phi 4/Phi 4 Mini models with 128K context window
./mistralrs-server -i plain -m microsoft/Phi-4-mini-instruct -a phi3
-
🧮 Enhance ISQ by collecting an imatrix from calibration data: documentation
./mistralrs-server -i --isq Q4K plain -m meta-llama/Llama-3.2-3B-Instruct --calibration-file calibration_data/calibration_datav3_small.txt
-
🌲📷 Run the FLUX.1 diffusion model: documentation and guide here
./mistralrs-server --port 1234 diffusion-plain -m black-forest-labs/FLUX.1-schnell -a flux
-
Other models: see a support matrix and how to run them
Mistral.rs supports several model categories:
- Text to Text
- Text+Image to Text: Vision (see the docs)
- Text to Image: Image Generation (see the docs)
Easy:
- Lightweight OpenAI API compatible HTTP server
- Python API
- Grammar support with JSON Schema, Regex, Lark, and Guidance via LLGuidance library
-
ISQ (In situ quantization): run
.safetensors
models directly from 🤗 Hugging Face by quantizing in-place- Enhance performance with an imatrix!
- Automatic device mapping to easily load and run models across multiple GPUs and CPU.
- Specify custom chat templates easily: chat templates
Fast:
- Apple silicon support: ARM NEON, Accelerate, Metal
- Accelerated CPU inference with MKL, AVX support
- CUDA support with FlashAttention and cuDNN.
- Automatic tensor-parallelism support with NCCL: distributed documentation
Quantization:
- Details
- GGML: 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit, with imatrix support
- GPTQ: 2-bit, 3-bit, 4-bit and 8-bit, with Marlin kernel support in 4-bit and 8-bit.
- HQQ: 4-bit and 8 bit, with ISQ support
- FP8
- BNB: bitsandbytes int8, fp4, nf4 support
Powerful:
- LoRA support with weight merging
- First X-LoRA inference platform with first class support
- AnyMoE: Build a memory-efficient MoE model from anything, in seconds
- Various sampling and penalty methods
- Native tool calling support for Llama, Mistral Small, Mistral Nemo, Hermes, and DeepSeek models: docs
- Prompt chunking: process large prompts in a more manageable way
Advanced features:
- PagedAttention and continuous batching (CUDA and Metal support)
- FlashAttention V2/V3
- Prefix caching
- Topology: Configure ISQ and device mapping easily
- UQFF: Quantized file format for easy mixing of quants, collection here.
- Speculative Decoding: Mix supported models as the draft model or the target model
- Dynamic LoRA adapter activation with adapter preloading: examples and docs
- Integrated agentic web search capabilities, enabling models to easily access the internet.
Documentation for mistral.rs can be found here.
This is a demo of interactive mode with streaming running Phi 3 128k mini with quantization via ISQ to Q4K.
https://github.com/EricLBuehler/mistral.rs/assets/65165915/09d9a30f-1e22-4b9a-9006-4ec6ebc6473c
Note: See supported models for more information
Model | Supports quantization | Supports adapters | Supports device mapping | Supported by AnyMoE |
---|---|---|---|---|
Mistral v0.1/v0.2/v0.3 | ✅ | ✅ | ✅ | ✅ |
Gemma | ✅ | ✅ | ✅ | ✅ |
Llama 3.1/3.2 | ✅ | ✅ | ✅ | ✅ |
Mixtral | ✅ | ✅ | ✅ | |
Phi 2 | ✅ | ✅ | ✅ | ✅ |
Phi 3 | ✅ | ✅ | ✅ | ✅ |
Phi 3.5 MoE | ✅ | ✅ | ||
Qwen 2.5 | ✅ | ✅ | ✅ | |
Phi 3 Vision | ✅ | ✅ | ✅ | |
Idefics 2 | ✅ | ✅ | ✅ | |
Gemma 2 | ✅ | ✅ | ✅ | ✅ |
Starcoder 2 | ✅ | ✅ | ✅ | ✅ |
LLaVa Next | ✅ | ✅ | ✅ | |
LLaVa | ✅ | ✅ | ✅ | |
Llama 3.2 Vision | ✅ | ✅ | ||
Qwen2-VL | ✅ | ✅ | ||
Idefics 3 | ✅ | ✅ | ✅ | |
DeepseekV2 | ✅ | ✅ | ||
DeepseekV3 | ✅ | ✅ | ||
MinCPM-O 2.6 | ✅ | ✅ | ||
Phi 4 Multimodal | ✅ | ✅ | ||
Qwen2.5-VL | ✅ | ✅ | ||
Gemma 3 | ✅ | ✅ | ✅ | |
Mistral 3 | ✅ | ✅ | ✅ |
Rust multithreaded/async API for easy integration into any application.
- Docs
- Examples
- To install: Add
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git" }
Python API for mistral.rs.
OpenAI API compatible API server
- CUDA:
- Compile with the
cuda
feature:--features cuda
- FlashAttention support: compile with the
flash-attn
feature - cuDNN support: compile with the
cudnn
feature:--features cudnn
- Compile with the
- Metal:
- Compile with the
metal
feature:--features metal
- Compile with the
- CPU:
- Intel MKL: compile with the
mkl
feature:--features mkl
- Apple Accelerate: compile with the
accelerate
feature:--features accelerate
- ARM NEON and AVX are used automatically
- Intel MKL: compile with the
Enabling features is done by passing --features ...
to the build system. When using cargo run
or maturin develop
, pass the --features
flag before the --
separating build flags from runtime flags.
- To enable a single feature like
metal
:cargo build --release --features metal
. - To enable multiple features, specify them in quotes:
cargo build --release --features "cuda flash-attn cudnn"
.
Note: You can use our Docker containers here. Learn more about running Docker containers: https://docs.docker.com/engine/reference/run/
- Install the Python package here.
- The Python package has wheels on PyPi!
-
Install required packages:
-
OpenSSL
(Example on Ubuntu:sudo apt install libssl-dev
) -
Linux only:
pkg-config
(Example on Ubuntu:sudo apt install pkg-config
)
-
-
Install Rust: https://rustup.rs/
Example on Ubuntu:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source $HOME/.cargo/env
-
Optional: Set HF token correctly (skip if already set or your model is not gated, or if you want to use the
token_source
parameters in Python or the command line.)- Note: you can install
huggingface-cli
as documented here.
huggingface-cli login
- Note: you can install
-
Download the code:
git clone https://github.com/EricLBuehler/mistral.rs.git cd mistral.rs
-
Build or install:
-
Base build command
cargo build --release
-
Build with CUDA support
cargo build --release --features cuda
-
Build with CUDA and Flash Attention V2 support
cargo build --release --features "cuda flash-attn"
-
Build with Metal support
cargo build --release --features metal
-
Build with Accelerate support
cargo build --release --features accelerate
-
Build with MKL support
cargo build --release --features mkl
-
Install with
cargo install
for easy command line usagePass the same values to
--features
as you would forcargo build
cargo install --path mistralrs-server --features cuda
-
-
The build process will output a binary
mistralrs-server
at./target/release/mistralrs-server
. We can switch to that directory so that the binary can be accessed as./mistralrs-server
with the following command:Example on Ubuntu:
cd target/release
-
Use our APIs and integrations:
There are 2 ways to get models with mistral.rs:
- From Hugging Face Hub (easiest)
- From local files
- Running a GGUF model
- Specify local paths
Mistral.rs can automatically download models from HF Hub. To access gated models, you should provide a token source. They may be one of:
-
literal:<value>
: Load from a specified literal -
env:<value>
: Load from a specified environment variable -
path:<value>
: Load from a specified file -
cache
: default: Load from the HF token at ~/.cache/huggingface/token or equivalent. -
none
: Use no HF token
This is passed in the following ways:
- Command line:
./mistralrs-server --token-source none -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3
- Python:
Here is an example of setting the token source.
If token cannot be loaded, no token will be used (i.e. effectively using none
).
You can also instruct mistral.rs to load models fully locally by modifying the *_model_id
arguments or options:
./mistralrs-server --port 1234 plain -m . -a mistral
Throughout mistral.rs, any model ID argument or option may be a local path and should contain the following files for each model ID option:
-
--model-id
(server) ormodel_id
(python/rust) or--tok-model-id
(server) ortok_model_id
(python/rust):config.json
tokenizer_config.json
-
tokenizer.json
(if not specified separately) -
.safetensors
/.bin
/.pth
/.pt
files (defaults to.safetensors
) -
preprocessor_config.json
(required for vision models). -
processor_config.json
(optional for vision models).
-
--quantized-model-id
(server) orquantized_model_id
(python/rust):- Specified
.gguf
or.ggml
file.
- Specified
-
--x-lora-model-id
(server) orxlora_model_id
(python/rust):xlora_classifier.safetensors
xlora_config.json
- Adapters
.safetensors
andadapter_config.json
files in their respective directories
-
--adapters-model-id
(server) oradapters_model_id
(python/rust):- Adapters
.safetensors
andadapter_config.json
files in their respective directories
- Adapters
To run GGUF models, the only mandatory arguments are the quantized model ID and the quantized filename. The quantized model ID can be a HF model ID.
You must also specify either -i
for interactive mode or --port
to launch a server, just like when running a non-GGUF model with the CLI
GGUF models contain a tokenizer. However, mistral.rs allows you to run the model with a tokenizer from a specified model, typically the official one. This means there are two options:
Running with a tokenizer model ID enables you to specify the model ID to source the tokenizer from:
./mistralrs-server gguf -m bartowski/Phi-3.5-mini-instruct-GGUF -f Phi-3.5-mini-instruct-Q4_K_M.gguf -t microsoft/Phi-3.5-mini-instruct
If the specified tokenizer model ID contains a tokenizer.json
, then it will be used over the GGUF tokenizer.
Using the builtin tokenizer:
./mistralrs-server gguf -m bartowski/Phi-3.5-mini-instruct-GGUF -f Phi-3.5-mini-instruct-Q4_K_M.gguf
(or using a local file):
./mistralrs-server gguf -m path/to/files -f Phi-3.5-mini-instruct-Q4_K_M.gguf
There are a few more ways to configure:
Chat template:
The chat template can be automatically detected and loaded from the GGUF file if no other chat template source is specified including the tokenizer model ID.
If that does not work, you can either provide a tokenizer (recommended), or specify a custom chat template.
./mistralrs-server --chat-template <chat_template> gguf -m . -f Phi-3.5-mini-instruct-Q4_K_M.gguf
Tokenizer
The following tokenizer model types are currently supported. If you would like one to be added, please raise an issue. Otherwise, please consider using the method demonstrated in examples below, where the tokenizer is sourced from Hugging Face.
Supported GGUF tokenizer types
-
llama
(sentencepiece) -
gpt2
(BPE)
Mistral.rs uses subcommands to control the model type. Please run ./mistralrs-server --help
to see the subcommands which categorize the models by kind.
Note: for plain models, you can specify the data type to load and run in. This must be one of
f32
,f16
,bf16
orauto
to choose based on the device. This is specified in the--dype
/-d
parameter after the model architecture (plain
). For quantized models (gguf/ggml), you may specify data type off32
orbf16
(f16
is not recommended due to its lower precision in quantized inference).
If you do not specify the architecture, an attempt will be made to use the model's config. If this fails, please raise an issue.
mistral
gemma
mixtral
llama
phi2
phi3
phi3.5moe
qwen2
gemma2
starcoder2
deepseekv2
deepseekv3
Note: for vision models, you can specify the data type to load and run in. This must be one of
f32
,f16
,bf16
orauto
to choose based on the device. This is specified in the--dype
/-d
parameter after the model architecture (vision-plain
).
phi3v
idefics2
llava_next
llava
vllama
qwen2vl
idefics3
minicpmo
phi4mm
qwen2_5vl
gemma3
mistral3
Plain:
llama
phi2
phi3
starcoder2
qwen2
With adapters:
llama
phi3
You can launch interactive mode, a simple chat application running in the terminal, by passing -i
:
./mistralrs-server -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3
Vision models work too:
./mistralrs-server -i vision-plain -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k -a vllama
And even diffusion models:
./mistralrs-server -i diffusion-plain -m black-forest-labs/FLUX.1-schnell -a flux
On Apple Silicon (Metal
), run with throughput log, settings of paged attention (maximum usage of 4GB for kv cache) and dtype (bf16 for kv cache and attention)
cargo build --release --features metal
./target/release/mistralrs-server -i --throughput --paged-attn --pa-gpu-mem 4096 gguf --dtype bf16 -m /Users/Downloads/ -f Phi-3.5-mini-instruct-Q4_K_M.gguf
You can an HTTP server
./mistralrs-server --port 1234 plain -m microsoft/Phi-3.5-MoE-instruct -a phi3.5moe
We provide a method to select models with a .toml
file. The keys are the same as the command line, with no_kv_cache
and tokenizer_json
being "global" keys.
Example:
./mistralrs-server --port 1234 toml -f toml-selectors/gguf.toml
Device | Mistral.rs Completion T/s | Llama.cpp Completion T/s | Model | Quant |
---|---|---|---|---|
A10 GPU, CUDA | 86 | 83 | mistral-7b | 4_K_M |
Intel Xeon 8358 CPU, AVX | 11 | 23 | mistral-7b | 4_K_M |
Raspberry Pi 5 (8GB), Neon | 2 | 3 | mistral-7b | 2_K |
A100 GPU, CUDA | 131 | 134 | mistral-7b | 4_K_M |
RTX 6000 GPU, CUDA | 103 | 96 | mistral-7b | 4_K_M |
Note: All CUDA tests for mistral.rs conducted with PagedAttention enabled, block size = 32
Please submit more benchmarks via raising an issue!
Quantization support
Model | GGUF | GGML | ISQ |
---|---|---|---|
Mistral | ✅ | ✅ | |
Gemma | ✅ | ||
Llama | ✅ | ✅ | ✅ |
Mixtral | ✅ | ✅ | |
Phi 2 | ✅ | ✅ | |
Phi 3 | ✅ | ✅ | |
Phi 3.5 MoE | ✅ | ||
Qwen 2.5 | ✅ | ||
Phi 3 Vision | ✅ | ||
Idefics 2 | ✅ | ||
Gemma 2 | ✅ | ||
Starcoder 2 | ✅ | ✅ | |
LLaVa Next | ✅ | ||
LLaVa | ✅ | ||
Llama 3.2 Vision | ✅ | ||
Qwen2-VL | ✅ | ||
Idefics 3 | ✅ | ||
Deepseek V2 | ✅ | ||
Deepseek V3 | ✅ | ||
MiniCPM-O 2.6 | ✅ | ||
Qwen2.5-VL | ✅ | ||
Gemma 3 | ✅ | ||
Mistral 3 | ✅ |
Device mapping support
Model category | Supported |
---|---|
Plain | ✅ |
GGUF | ✅ |
GGML | |
Vision Plain | ✅ |
X-LoRA and LoRA support
Model | X-LoRA | X-LoRA+GGUF | X-LoRA+GGML |
---|---|---|---|
Mistral | ✅ | ✅ | |
Gemma | ✅ | ||
Llama | ✅ | ✅ | ✅ |
Mixtral | ✅ | ✅ | |
Phi 2 | ✅ | ||
Phi 3 | ✅ | ✅ | |
Phi 3.5 MoE | |||
Qwen 2.5 | |||
Phi 3 Vision | |||
Idefics 2 | |||
Gemma 2 | ✅ | ||
Starcoder 2 | ✅ | ||
LLaVa Next | |||
LLaVa | |||
Qwen2-VL | |||
Idefics 3 | |||
Deepseek V2 | |||
Deepseek V3 | |||
MiniCPM-O 2.6 | |||
Qwen2.5-VL | |||
Gemma 3 | |||
Mistral 3 |
AnyMoE support
Model | AnyMoE |
---|---|
Mistral 7B | ✅ |
Gemma | ✅ |
Llama | ✅ |
Mixtral | |
Phi 2 | ✅ |
Phi 3 | ✅ |
Phi 3.5 MoE | |
Qwen 2.5 | ✅ |
Phi 3 Vision | |
Idefics 2 | |
Gemma 2 | ✅ |
Starcoder 2 | ✅ |
LLaVa Next | ✅ |
LLaVa | ✅ |
Llama 3.2 Vision | |
Qwen2-VL | |
Idefics 3 | ✅ |
Deepseek V2 | |
Deepseek V3 | |
MiniCPM-O 2.6 | |
Qwen2.5-VL | |
Gemma 3 | ✅ |
Mistral 3 | ✅ |
To use a derivative model, select the model architecture using the correct subcommand. To see what can be passed for the architecture, pass --help
after the subcommand. For example, when using a different model than the default, specify the following for the following types of models:
- Plain: Model id
- Quantized: Quantized model id, quantized filename, and tokenizer id
- X-LoRA: Model id, X-LoRA ordering
- X-LoRA quantized: Quantized model id, quantized filename, tokenizer id, and X-LoRA ordering
- LoRA: Model id, LoRA ordering
- LoRA quantized: Quantized model id, quantized filename, tokenizer id, and LoRA ordering
- Vision Plain: Model id
See this section to determine if it is necessary to prepare an X-LoRA/LoRA ordering file, it is always necessary if the target modules or architecture changed, or if the adapter order changed.
It is also important to check the chat template style of the model. If the HF hub repo has a tokenizer_config.json
file, it is not necessary to specify. Otherwise, templates can be found in chat_templates
and should be passed before the subcommand. If the model is not instruction tuned, no chat template will be found and the APIs will only accept a prompt, no messages.
For example, when using a Zephyr model:
./mistralrs-server --port 1234 --log output.txt gguf -t HuggingFaceH4/zephyr-7b-beta -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q5_0.gguf
An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting the x-lora-*
architecture, and LoRA support by selecting the lora-*
architecture. Please find docs for adapter models here. Examples may be found here.
Mistral.rs will attempt to automatically load a chat template and tokenizer. This enables high flexibility across models and ensures accurate and flexible chat templating. However, this behavior can be customized. Please find detailed documentation here.
Thank you for contributing! If you have any problems or want to contribute something, please raise an issue or pull request. If you want to add a new model, please contact us via an issue and we can coordinate how to do this.
- Debugging with the environment variable
MISTRALRS_DEBUG=1
causes the following things- If loading a GGUF or GGML model, this will output a file containing the names, shapes, and types of each tensor.
-
mistralrs_gguf_tensors.txt
ormistralrs_ggml_tensors.txt
-
- More logging.
- If loading a GGUF or GGML model, this will output a file containing the names, shapes, and types of each tensor.
- Setting the CUDA compiler path:
- Set the
NVCC_CCBIN
environment variable during build.
- Set the
- Error:
recompile with -fPIE
:- Some Linux distributions require compiling with
-fPIE
. - Set the
CUDA_NVCC_FLAGS
environment variable to-fPIE
during build:CUDA_NVCC_FLAGS=-fPIE
- Some Linux distributions require compiling with
- Error
CUDA_ERROR_NOT_FOUND
or symbol not found when using a normal or vison model:- For non-quantized models, you can specify the data type to load and run in. This must be one of
f32
,f16
,bf16
orauto
to choose based on the device.
- For non-quantized models, you can specify the data type to load and run in. This must be one of
- What is the minimum supported CUDA compute cap?
- The minimum CUDA compute cap is 5.3.
This project would not be possible without the excellent work at candle
. Additionally, thank you to all contributors! Contributing can range from raising an issue or suggesting a feature to adding some new functionality.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for mistral.rs
Similar Open Source Tools

mistral.rs
Mistral.rs is a fast LLM inference platform written in Rust. We support inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.

rwkv.cpp
rwkv.cpp is a port of BlinkDL/RWKV-LM to ggerganov/ggml, supporting FP32, FP16, and quantized INT4, INT5, and INT8 inference. It focuses on CPU but also supports cuBLAS. The project provides a C library rwkv.h and a Python wrapper. RWKV is a large language model architecture with models like RWKV v5 and v6. It requires only state from the previous step for calculations, making it CPU-friendly on large context lengths. Users are advised to test all available formats for perplexity and latency on a representative dataset before serious use.

aicommit2
AICommit2 is a Reactive CLI tool that streamlines interactions with various AI providers such as OpenAI, Anthropic Claude, Gemini, Mistral AI, Cohere, and unofficial providers like Huggingface and Clova X. Users can request multiple AI simultaneously to generate git commit messages without waiting for all AI responses. The tool runs 'git diff' to grab code changes, sends them to configured AI, and returns the AI-generated commit message. Users can set API keys or Cookies for different providers and configure options like locale, generate number of messages, commit type, proxy, timeout, max-length, and more. AICommit2 can be used both locally with Ollama and remotely with supported providers, offering flexibility and efficiency in generating commit messages.

foul-play
Foul Play is a Pokémon battle-bot that can play single battles in all generations on Pokemon Showdown. It requires Python 3.10+. The bot uses environment variables for configuration and supports different game modes and battle strategies. Users can specify teams and choose between algorithms like Monte-Carlo Tree Search and Expectiminimax. Foul Play can be run locally or with Docker, and the engine used for battles must be built from source. The tool provides flexibility in gameplay and strategy for Pokémon battles.

TPI-LLM
TPI-LLM (Tensor Parallelism Inference for Large Language Models) is a system designed to bring LLM functions to low-resource edge devices, addressing privacy concerns by enabling LLM inference on edge devices with limited resources. It leverages multiple edge devices for inference through tensor parallelism and a sliding window memory scheduler to minimize memory usage. TPI-LLM demonstrates significant improvements in TTFT and token latency compared to other models, and plans to support infinitely large models with low token latency in the future.

cortex.cpp
Cortex.cpp is an open-source platform designed as the brain for robots, offering functionalities such as vision, speech, language, tabular data processing, and action. It provides an AI platform for running AI models with multi-engine support, hardware optimization with automatic GPU detection, and an OpenAI-compatible API. Users can download models from the Hugging Face model hub, run models, manage resources, and access advanced features like multiple quantizations and engine management. The tool is under active development, promising rapid improvements for users.

local-deep-research
Local Deep Research is a powerful AI-powered research assistant that performs deep, iterative analysis using multiple LLMs and web searches. It can be run locally for privacy or configured to use cloud-based LLMs for enhanced capabilities. The tool offers advanced research capabilities, flexible LLM support, rich output options, privacy-focused operation, enhanced search integration, and academic & scientific integration. It also provides a web interface, command line interface, and supports multiple LLM providers and search engines. Users can configure AI models, search engines, and research parameters for customized research experiences.

distributed-llama
Distributed Llama is a tool that allows you to run large language models (LLMs) on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage. It uses TCP sockets to synchronize the state of the neural network, and you can easily configure your AI cluster by using a home router. Distributed Llama supports models such as Llama 2 (7B, 13B, 70B) chat and non-chat versions, Llama 3, and Grok-1 (314B).

gollama
Gollama is a delightful tool that brings Ollama, your offline conversational AI companion, directly into your terminal. It provides a fun and interactive way to generate responses from various models without needing internet connectivity. Whether you're brainstorming ideas, exploring creative writing, or just looking for inspiration, Gollama is here to assist you. The tool offers an interactive interface, customizable prompts, multiple models selection, and visual feedback to enhance user experience. It can be installed via different methods like downloading the latest release, using Go, running with Docker, or building from source. Users can interact with Gollama through various options like specifying a custom base URL, prompt, model, and enabling raw output mode. The tool supports different modes like interactive, piped, CLI with image, and TUI with image. Gollama relies on third-party packages like bubbletea, glamour, huh, and lipgloss. The roadmap includes implementing piped mode, support for extracting codeblocks, copying responses/codeblocks to clipboard, GitHub Actions for automated releases, and downloading models directly from Ollama using the rest API. Contributions are welcome, and the project is licensed under the MIT License.

evalchemy
Evalchemy is a unified and easy-to-use toolkit for evaluating language models, focusing on post-trained models. It integrates multiple existing benchmarks such as RepoBench, AlpacaEval, and ZeroEval. Key features include unified installation, parallel evaluation, simplified usage, and results management. Users can run various benchmarks with a consistent command-line interface and track results locally or integrate with a database for systematic tracking and leaderboard submission.

graphrag-visualizer
GraphRAG Visualizer is an application designed to visualize Microsoft GraphRAG artifacts by uploading parquet files generated from the GraphRAG indexing pipeline. Users can view and analyze data in 2D or 3D graphs, display data tables, search for specific nodes or relationships, and process artifacts locally for data security and privacy.

ovos-installer
The ovos-installer is a simple and multilingual tool designed to install Open Voice OS and HiveMind using Bash, Whiptail, and Ansible. It supports various Linux distributions and provides an automated installation process. Users can easily start and stop services, update their Open Voice OS instance, and uninstall the tool if needed. The installer also allows for non-interactive installation through scenario files. It offers a user-friendly way to set up Open Voice OS on different systems.

rpaframework
RPA Framework is an open-source collection of libraries and tools for Robotic Process Automation (RPA), designed to be used with Robot Framework and Python. It offers well-documented core libraries for Software Robot Developers, optimized for Robocorp Control Room and Developer Tools, and accepts external contributions. The project includes various libraries for tasks like archiving, browser automation, date/time manipulations, cloud services integration, encryption operations, database interactions, desktop automation, document processing, email operations, Excel manipulation, file system operations, FTP interactions, web API interactions, image manipulation, AI services, and more. The development of the repository is Python-based and requires Python version 3.8+, with tooling based on poetry and invoke for compiling, building, and running the package. The project is licensed under the Apache License 2.0.

StableToolBench
StableToolBench is a new benchmark developed to address the instability of Tool Learning benchmarks. It aims to balance stability and reality by introducing features such as a Virtual API System with caching and API simulators, a new set of solvable queries determined by LLMs, and a Stable Evaluation System using GPT-4. The Virtual API Server can be set up either by building from source or using a prebuilt Docker image. Users can test the server using provided scripts and evaluate models with Solvable Pass Rate and Solvable Win Rate metrics. The tool also includes model experiments results comparing different models' performance.

Large-Language-Models-play-StarCraftII
Large Language Models Play StarCraft II is a project that explores the capabilities of large language models (LLMs) in playing the game StarCraft II. The project introduces TextStarCraft II, a textual environment for the game, and a Chain of Summarization method for analyzing game information and making strategic decisions. Through experiments, the project demonstrates that LLM agents can defeat the built-in AI at a challenging difficulty level. The project provides benchmarks and a summarization approach to enhance strategic planning and interpretability in StarCraft II gameplay.

paperless-gpt
paperless-gpt is a tool designed to generate accurate and meaningful document titles and tags for paperless-ngx using Large Language Models (LLMs). It supports multiple LLM providers, including OpenAI and Ollama. With paperless-gpt, you can streamline your document management by automatically suggesting appropriate titles and tags based on the content of your scanned documents. The tool offers features like multiple LLM support, customizable prompts, easy integration with paperless-ngx, user-friendly interface for reviewing and applying suggestions, dockerized deployment, automatic document processing, and an experimental OCR feature.
For similar tasks

LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

onnxruntime-genai
ONNX Runtime Generative AI is a library that provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. Users can call a high level `generate()` method, or run each iteration of the model in a loop. It supports greedy/beam search and TopP, TopK sampling to generate token sequences, has built in logits processing like repetition penalties, and allows for easy custom scoring.

jupyter-ai
Jupyter AI connects generative AI with Jupyter notebooks. It provides a user-friendly and powerful way to explore generative AI models in notebooks and improve your productivity in JupyterLab and the Jupyter Notebook. Specifically, Jupyter AI offers: * An `%%ai` magic that turns the Jupyter notebook into a reproducible generative AI playground. This works anywhere the IPython kernel runs (JupyterLab, Jupyter Notebook, Google Colab, Kaggle, VSCode, etc.). * A native chat UI in JupyterLab that enables you to work with generative AI as a conversational assistant. * Support for a wide range of generative model providers, including AI21, Anthropic, AWS, Cohere, Gemini, Hugging Face, NVIDIA, and OpenAI. * Local model support through GPT4All, enabling use of generative AI models on consumer grade machines with ease and privacy.

khoj
Khoj is an open-source, personal AI assistant that extends your capabilities by creating always-available AI agents. You can share your notes and documents to extend your digital brain, and your AI agents have access to the internet, allowing you to incorporate real-time information. Khoj is accessible on Desktop, Emacs, Obsidian, Web, and Whatsapp, and you can share PDF, markdown, org-mode, notion files, and GitHub repositories. You'll get fast, accurate semantic search on top of your docs, and your agents can create deeply personal images and understand your speech. Khoj is self-hostable and always will be.

langchain_dart
LangChain.dart is a Dart port of the popular LangChain Python framework created by Harrison Chase. LangChain provides a set of ready-to-use components for working with language models and a standard interface for chaining them together to formulate more advanced use cases (e.g. chatbots, Q&A with RAG, agents, summarization, extraction, etc.). The components can be grouped into a few core modules: * **Model I/O:** LangChain offers a unified API for interacting with various LLM providers (e.g. OpenAI, Google, Mistral, Ollama, etc.), allowing developers to switch between them with ease. Additionally, it provides tools for managing model inputs (prompt templates and example selectors) and parsing the resulting model outputs (output parsers). * **Retrieval:** assists in loading user data (via document loaders), transforming it (with text splitters), extracting its meaning (using embedding models), storing (in vector stores) and retrieving it (through retrievers) so that it can be used to ground the model's responses (i.e. Retrieval-Augmented Generation or RAG). * **Agents:** "bots" that leverage LLMs to make informed decisions about which available tools (such as web search, calculators, database lookup, etc.) to use to accomplish the designated task. The different components can be composed together using the LangChain Expression Language (LCEL).

danswer
Danswer is an open-source Gen-AI Chat and Unified Search tool that connects to your company's docs, apps, and people. It provides a Chat interface and plugs into any LLM of your choice. Danswer can be deployed anywhere and for any scale - on a laptop, on-premise, or to cloud. Since you own the deployment, your user data and chats are fully in your own control. Danswer is MIT licensed and designed to be modular and easily extensible. The system also comes fully ready for production usage with user authentication, role management (admin/basic users), chat persistence, and a UI for configuring Personas (AI Assistants) and their Prompts. Danswer also serves as a Unified Search across all common workplace tools such as Slack, Google Drive, Confluence, etc. By combining LLMs and team specific knowledge, Danswer becomes a subject matter expert for the team. Imagine ChatGPT if it had access to your team's unique knowledge! It enables questions such as "A customer wants feature X, is this already supported?" or "Where's the pull request for feature Y?"

infinity
Infinity is an AI-native database designed for LLM applications, providing incredibly fast full-text and vector search capabilities. It supports a wide range of data types, including vectors, full-text, and structured data, and offers a fused search feature that combines multiple embeddings and full text. Infinity is easy to use, with an intuitive Python API and a single-binary architecture that simplifies deployment. It achieves high performance, with 0.1 milliseconds query latency on million-scale vector datasets and up to 15K QPS.
For similar jobs

h2ogpt
h2oGPT is an Apache V2 open-source project that allows users to query and summarize documents or chat with local private GPT LLMs. It features a private offline database of any documents (PDFs, Excel, Word, Images, Video Frames, Youtube, Audio, Code, Text, MarkDown, etc.), a persistent database (Chroma, Weaviate, or in-memory FAISS) using accurate embeddings (instructor-large, all-MiniLM-L6-v2, etc.), and efficient use of context using instruct-tuned LLMs (no need for LangChain's few-shot approach). h2oGPT also offers parallel summarization and extraction, reaching an output of 80 tokens per second with the 13B LLaMa2 model, HYDE (Hypothetical Document Embeddings) for enhanced retrieval based upon LLM responses, a variety of models supported (LLaMa2, Mistral, Falcon, Vicuna, WizardLM. With AutoGPTQ, 4-bit/8-bit, LORA, etc.), GPU support from HF and LLaMa.cpp GGML models, and CPU support using HF, LLaMa.cpp, and GPT4ALL models. Additionally, h2oGPT provides Attention Sinks for arbitrarily long generation (LLaMa-2, Mistral, MPT, Pythia, Falcon, etc.), a UI or CLI with streaming of all models, the ability to upload and view documents through the UI (control multiple collaborative or personal collections), Vision Models LLaVa, Claude-3, Gemini-Pro-Vision, GPT-4-Vision, Image Generation Stable Diffusion (sdxl-turbo, sdxl) and PlaygroundAI (playv2), Voice STT using Whisper with streaming audio conversion, Voice TTS using MIT-Licensed Microsoft Speech T5 with multiple voices and Streaming audio conversion, Voice TTS using MPL2-Licensed TTS including Voice Cloning and Streaming audio conversion, AI Assistant Voice Control Mode for hands-free control of h2oGPT chat, Bake-off UI mode against many models at the same time, Easy Download of model artifacts and control over models like LLaMa.cpp through the UI, Authentication in the UI by user/password via Native or Google OAuth, State Preservation in the UI by user/password, Linux, Docker, macOS, and Windows support, Easy Windows Installer for Windows 10 64-bit (CPU/CUDA), Easy macOS Installer for macOS (CPU/M1/M2), Inference Servers support (oLLaMa, HF TGI server, vLLM, Gradio, ExLLaMa, Replicate, OpenAI, Azure OpenAI, Anthropic), OpenAI-compliant, Server Proxy API (h2oGPT acts as drop-in-replacement to OpenAI server), Python client API (to talk to Gradio server), JSON Mode with any model via code block extraction. Also supports MistralAI JSON mode, Claude-3 via function calling with strict Schema, OpenAI via JSON mode, and vLLM via guided_json with strict Schema, Web-Search integration with Chat and Document Q/A, Agents for Search, Document Q/A, Python Code, CSV frames (Experimental, best with OpenAI currently), Evaluate performance using reward models, and Quality maintained with over 1000 unit and integration tests taking over 4 GPU-hours.

mistral.rs
Mistral.rs is a fast LLM inference platform written in Rust. We support inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.

ollama
Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications. Ollama is designed to be easy to use and accessible to developers of all levels. It is open source and available for free on GitHub.

llama-cpp-agent
The llama-cpp-agent framework is a tool designed for easy interaction with Large Language Models (LLMs). Allowing users to chat with LLM models, execute structured function calls and get structured output (objects). It provides a simple yet robust interface and supports llama-cpp-python and OpenAI endpoints with GBNF grammar support (like the llama-cpp-python server) and the llama.cpp backend server. It works by generating a formal GGML-BNF grammar of the user defined structures and functions, which is then used by llama.cpp to generate text valid to that grammar. In contrast to most GBNF grammar generators it also supports nested objects, dictionaries, enums and lists of them.

llama_ros
This repository provides a set of ROS 2 packages to integrate llama.cpp into ROS 2. By using the llama_ros packages, you can easily incorporate the powerful optimization capabilities of llama.cpp into your ROS 2 projects by running GGUF-based LLMs and VLMs.

MITSUHA
OneReality is a virtual waifu/assistant that you can speak to through your mic and it'll speak back to you! It has many features such as: * You can speak to her with a mic * It can speak back to you * Has short-term memory and long-term memory * Can open apps * Smarter than you * Fluent in English, Japanese, Korean, and Chinese * Can control your smart home like Alexa if you set up Tuya (more info in Prerequisites) It is built with Python, Llama-cpp-python, Whisper, SpeechRecognition, PocketSphinx, VITS-fast-fine-tuning, VITS-simple-api, HyperDB, Sentence Transformers, and Tuya Cloud IoT.

wenxin-starter
WenXin-Starter is a spring-boot-starter for Baidu's "Wenxin Qianfan WENXINWORKSHOP" large model, which can help you quickly access Baidu's AI capabilities. It fully integrates the official API documentation of Wenxin Qianfan. Supports text-to-image generation, built-in dialogue memory, and supports streaming return of dialogue. Supports QPS control of a single model and supports queuing mechanism. Plugins will be added soon.

FlexFlow
FlexFlow Serve is an open-source compiler and distributed system for **low latency**, **high performance** LLM serving. FlexFlow Serve outperforms existing systems by 1.3-2.0x for single-node, multi-GPU inference and by 1.4-2.4x for multi-node, multi-GPU inference.