mistral.rs

Blazingly fast LLM inference.

Stars: 5396

Visit

Mistral.rs is a fast LLM inference platform written in Rust. We support inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.

README:

mistral.rs

Blazingly fast LLM inference.

Please submit requests for new models here.

Get started fast 🚀

Install
Get models
Deploy with our easy to use APIs

Quick examples

After following installation instructions

Check out UQFF for prequantized models of various methods!
- Models can be found here.
🔍🌐 Easily add web search capabilities to your models! Compatible with OpenAI's web_search_options parameter: documentation
```
./mistralrs-server --enable-search --port 1234 --isq q4k plain -m NousResearch/Hermes-3-Llama-3.1-8B
```
💎💎💎 Run the entire Gemma 3 Model family (1b, 4b, 12b, 27b) with 128k context length and vision support: documentation
```
./mistralrs-server -i vision-plain -m google/gemma-3-4b-it -a gemma3
```

Run the Mistral 3 Model with 128k context length and strong vision support: documentation

./mistralrs-server -i --isq q4k vision-plain -m mistralai/Mistral-Small-3.1-24B-Instruct-2503 -a mistral3

🐋🐋🐋 Run the Deepseek R1/V3 model with automatic tensor parallelism: documentation
```
./mistralrs-server -i --isq Q4K plain -m deepseek-ai/DeepSeek-R1
```

🐋🐋🐋 Run the Deepseek R1 distillations out of the box

./mistralrs-server -i --isq Q4K plain -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B
./mistralrs-server -i --isq Q4K plain -m deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
./mistralrs-server -i --isq Q4K plain -m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

🦙📷 Run the Llama 3.2 Vision Model: documentation and guide here

Credit

./mistralrs-server -i vision-plain -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k -a vllama

φ⁴ 📷 Run the Phi 4 Multimodal model: documentation and guide here

./mistralrs-server -i vision-plain -m microsoft/Phi-4-multimodal-instruct -a phi4mm

φ⁴ Run the new Phi 4/Phi 4 Mini models with 128K context window

./mistralrs-server -i plain -m microsoft/Phi-4-mini-instruct -a phi3

🧮 Enhance ISQ by collecting an imatrix from calibration data: documentation

./mistralrs-server -i --isq Q4K plain -m meta-llama/Llama-3.2-3B-Instruct --calibration-file calibration_data/calibration_datav3_small.txt

🌲📷 Run the FLUX.1 diffusion model: documentation and guide here

./mistralrs-server --port 1234 diffusion-plain -m black-forest-labs/FLUX.1-schnell -a flux

Other models: see a support matrix and how to run them

Mistral.rs supports several model categories:

Text to Text
Text+Image to Text: Vision (see the docs)
Text to Image: Image Generation (see the docs)

Description

Easy:

Lightweight OpenAI API compatible HTTP server
Python API
Grammar support with JSON Schema, Regex, Lark, and Guidance via LLGuidance library
ISQ (In situ quantization): run .safetensors models directly from 🤗 Hugging Face by quantizing in-place
- Enhance performance with an imatrix!
Automatic device mapping to easily load and run models across multiple GPUs and CPU.
Specify custom chat templates easily: chat templates

Fast:

Apple silicon support: ARM NEON, Accelerate, Metal
Accelerated CPU inference with MKL, AVX support
CUDA support with FlashAttention and cuDNN.
Automatic tensor-parallelism support with NCCL: distributed documentation

Quantization:

Details
GGML: 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit, with imatrix support
GPTQ: 2-bit, 3-bit, 4-bit and 8-bit, with Marlin kernel support in 4-bit and 8-bit.
HQQ: 4-bit and 8 bit, with ISQ support
FP8
BNB: bitsandbytes int8, fp4, nf4 support

Powerful:

LoRA support with weight merging
First X-LoRA inference platform with first class support
AnyMoE: Build a memory-efficient MoE model from anything, in seconds
Various sampling and penalty methods
Native tool calling support for Llama, Mistral Small, Mistral Nemo, Hermes, and DeepSeek models: docs
Prompt chunking: process large prompts in a more manageable way

Advanced features:

PagedAttention and continuous batching (CUDA and Metal support)
FlashAttention V2/V3
Prefix caching
Topology: Configure ISQ and device mapping easily
UQFF: Quantized file format for easy mixing of quants, collection here.
Speculative Decoding: Mix supported models as the draft model or the target model
Dynamic LoRA adapter activation with adapter preloading: examples and docs
Integrated agentic web search capabilities, enabling models to easily access the internet.

Documentation for mistral.rs can be found here.

This is a demo of interactive mode with streaming running Phi 3 128k mini with quantization via ISQ to Q4K.

https://github.com/EricLBuehler/mistral.rs/assets/65165915/09d9a30f-1e22-4b9a-9006-4ec6ebc6473c

Architecture Support matrix

Note: See supported models for more information

Model	Supports quantization	Supports adapters	Supports device mapping	Supported by AnyMoE
Mistral v0.1/v0.2/v0.3	✅	✅	✅	✅
Gemma	✅	✅	✅	✅
Llama 3.1/3.2	✅	✅	✅	✅
Mixtral	✅	✅	✅
Phi 2	✅	✅	✅	✅
Phi 3	✅	✅	✅	✅
Phi 3.5 MoE	✅		✅
Qwen 2.5	✅		✅	✅
Phi 3 Vision	✅		✅	✅
Idefics 2	✅		✅	✅
Gemma 2	✅	✅	✅	✅
Starcoder 2	✅	✅	✅	✅
LLaVa Next	✅		✅	✅
LLaVa	✅		✅	✅
Llama 3.2 Vision	✅		✅
Qwen2-VL	✅		✅
Idefics 3	✅		✅	✅
DeepseekV2	✅		✅
DeepseekV3	✅		✅
MinCPM-O 2.6	✅		✅
Phi 4 Multimodal	✅		✅
Qwen2.5-VL	✅		✅
Gemma 3	✅		✅	✅
Mistral 3	✅		✅	✅

APIs and Integrations

Rust Crate

Rust multithreaded/async API for easy integration into any application.

Docs
Examples
To install: Add mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git" }

Python API

Python API for mistral.rs.

HTTP Server

OpenAI API compatible API server

Llama Index integration (Python)

Docs: https://docs.llamaindex.ai/en/stable/examples/llm/mistral_rs/

Supported accelerators

CUDA:
- Compile with the cuda feature: --features cuda
- FlashAttention support: compile with the flash-attn feature
- cuDNN support: compile with thecudnn feature: --features cudnn
Metal:
- Compile with the metal feature: --features metal
CPU:
- Intel MKL: compile with the mkl feature: --features mkl
- Apple Accelerate: compile with the accelerate feature: --features accelerate
- ARM NEON and AVX are used automatically

Enabling features is done by passing --features ... to the build system. When using cargo run or maturin develop, pass the --features flag before the -- separating build flags from runtime flags.

To enable a single feature like metal: cargo build --release --features metal.
To enable multiple features, specify them in quotes: cargo build --release --features "cuda flash-attn cudnn".

Installation and Build

Note: You can use our Docker containers here. Learn more about running Docker containers: https://docs.docker.com/engine/reference/run/

Install the Python package here.
The Python package has wheels on PyPi!

Install required packages:
- OpenSSL (Example on Ubuntu: sudo apt install libssl-dev)
- Linux only: pkg-config (Example on Ubuntu: sudo apt install pkg-config)

Install Rust: https://rustup.rs/

Example on Ubuntu:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

Optional: Set HF token correctly (skip if already set or your model is not gated, or if you want to use the token_source parameters in Python or the command line.)
- Note: you can install huggingface-cli as documented here.
```
huggingface-cli login
```

Download the code:

git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs

Build or install:
- Base build command
```
cargo build --release
```
- Build with CUDA support
```
cargo build --release --features cuda
```
- Build with CUDA and Flash Attention V2 support
```
cargo build --release --features "cuda flash-attn"
```
- Build with Metal support
```
cargo build --release --features metal
```
- Build with Accelerate support
```
cargo build --release --features accelerate
```
- Build with MKL support
```
cargo build --release --features mkl
```
- Install with cargo install for easy command line usage
  
  Pass the same values to --features as you would for cargo build
```
cargo install --path mistralrs-server --features cuda
```
The build process will output a binary mistralrs-server at ./target/release/mistralrs-server. We can switch to that directory so that the binary can be accessed as ./mistralrs-server with the following command:

Example on Ubuntu:
```
cd target/release
```
Use our APIs and integrations:

APIs and integrations list

Getting models

There are 2 ways to get models with mistral.rs:

From Hugging Face Hub (easiest)
From local files
- Running a GGUF model
- Specify local paths

Getting models from Hugging Face Hub

Mistral.rs can automatically download models from HF Hub. To access gated models, you should provide a token source. They may be one of:

literal:<value>: Load from a specified literal
env:<value>: Load from a specified environment variable
path:<value>: Load from a specified file
cache: default: Load from the HF token at ~/.cache/huggingface/token or equivalent.
none: Use no HF token

This is passed in the following ways:

Command line:

./mistralrs-server --token-source none -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3

Python:

Here is an example of setting the token source.

If token cannot be loaded, no token will be used (i.e. effectively using none).

Loading models from local files:

You can also instruct mistral.rs to load models fully locally by modifying the *_model_id arguments or options:

./mistralrs-server --port 1234 plain -m . -a mistral

Throughout mistral.rs, any model ID argument or option may be a local path and should contain the following files for each model ID option:

--model-id (server) or model_id (python/rust) or --tok-model-id (server) or tok_model_id (python/rust):
- config.json
- tokenizer_config.json
- tokenizer.json (if not specified separately)
- .safetensors/.bin/.pth/.pt files (defaults to .safetensors)
- preprocessor_config.json (required for vision models).
- processor_config.json (optional for vision models).
--quantized-model-id (server) or quantized_model_id (python/rust):
- Specified .gguf or .ggml file.
--x-lora-model-id (server) or xlora_model_id (python/rust):
- xlora_classifier.safetensors
- xlora_config.json
- Adapters .safetensors and adapter_config.json files in their respective directories
--adapters-model-id (server) or adapters_model_id (python/rust):
- Adapters .safetensors and adapter_config.json files in their respective directories

Running GGUF models

To run GGUF models, the only mandatory arguments are the quantized model ID and the quantized filename. The quantized model ID can be a HF model ID.

You must also specify either -i for interactive mode or --port to launch a server, just like when running a non-GGUF model with the CLI

GGUF models contain a tokenizer. However, mistral.rs allows you to run the model with a tokenizer from a specified model, typically the official one. This means there are two options:

With a specified tokenizer
With the builtin tokenizer

With a specified tokenizer

Running with a tokenizer model ID enables you to specify the model ID to source the tokenizer from:

./mistralrs-server gguf -m bartowski/Phi-3.5-mini-instruct-GGUF -f Phi-3.5-mini-instruct-Q4_K_M.gguf -t microsoft/Phi-3.5-mini-instruct

If the specified tokenizer model ID contains a tokenizer.json, then it will be used over the GGUF tokenizer.

With the builtin tokenizer

Using the builtin tokenizer:

./mistralrs-server gguf -m bartowski/Phi-3.5-mini-instruct-GGUF -f Phi-3.5-mini-instruct-Q4_K_M.gguf

(or using a local file):

./mistralrs-server gguf -m path/to/files -f Phi-3.5-mini-instruct-Q4_K_M.gguf

There are a few more ways to configure:

Chat template:

The chat template can be automatically detected and loaded from the GGUF file if no other chat template source is specified including the tokenizer model ID.

If that does not work, you can either provide a tokenizer (recommended), or specify a custom chat template.

./mistralrs-server --chat-template <chat_template> gguf -m . -f Phi-3.5-mini-instruct-Q4_K_M.gguf

Tokenizer

The following tokenizer model types are currently supported. If you would like one to be added, please raise an issue. Otherwise, please consider using the method demonstrated in examples below, where the tokenizer is sourced from Hugging Face.

Supported GGUF tokenizer types

llama (sentencepiece)
gpt2 (BPE)

Run with the CLI

Mistral.rs uses subcommands to control the model type. Please run ./mistralrs-server --help to see the subcommands which categorize the models by kind.

Architecture for plain models

Note: for plain models, you can specify the data type to load and run in. This must be one of f32, f16, bf16 or auto to choose based on the device. This is specified in the --dype/-d parameter after the model architecture (plain). For quantized models (gguf/ggml), you may specify data type of f32 or bf16 (f16 is not recommended due to its lower precision in quantized inference).

If you do not specify the architecture, an attempt will be made to use the model's config. If this fails, please raise an issue.

mistral
gemma
mixtral
llama
phi2
phi3
phi3.5moe
qwen2
gemma2
starcoder2
deepseekv2
deepseekv3

Architecture for vision models

Note: for vision models, you can specify the data type to load and run in. This must be one of f32, f16, bf16 or auto to choose based on the device. This is specified in the --dype/-d parameter after the model architecture (vision-plain).

phi3v
idefics2
llava_next
llava
vllama
qwen2vl
idefics3
minicpmo
phi4mm
qwen2_5vl
gemma3
mistral3

Supported GGUF architectures

Plain:

llama
phi2
phi3
starcoder2
qwen2

With adapters:

llama
phi3

Interactive mode

You can launch interactive mode, a simple chat application running in the terminal, by passing -i:

./mistralrs-server -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3

Vision models work too:

./mistralrs-server -i vision-plain -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k -a vllama

And even diffusion models:

./mistralrs-server -i diffusion-plain -m black-forest-labs/FLUX.1-schnell -a flux

On Apple Silicon (Metal), run with throughput log, settings of paged attention (maximum usage of 4GB for kv cache) and dtype (bf16 for kv cache and attention)

cargo build --release --features metal
./target/release/mistralrs-server -i --throughput --paged-attn --pa-gpu-mem 4096 gguf --dtype bf16 -m /Users/Downloads/ -f Phi-3.5-mini-instruct-Q4_K_M.gguf

OpenAI HTTP server

You can an HTTP server

./mistralrs-server --port 1234 plain -m microsoft/Phi-3.5-MoE-instruct -a phi3.5moe

Structured selection with a `.toml` file

We provide a method to select models with a .toml file. The keys are the same as the command line, with no_kv_cache and tokenizer_json being "global" keys.

Example:

./mistralrs-server --port 1234 toml -f toml-selectors/gguf.toml

Benchmarks

Device	Mistral.rs Completion T/s	Llama.cpp Completion T/s	Model	Quant
A10 GPU, CUDA	86	83	mistral-7b	4_K_M
Intel Xeon 8358 CPU, AVX	11	23	mistral-7b	4_K_M
Raspberry Pi 5 (8GB), Neon	2	3	mistral-7b	2_K
A100 GPU, CUDA	131	134	mistral-7b	4_K_M
RTX 6000 GPU, CUDA	103	96	mistral-7b	4_K_M

Note: All CUDA tests for mistral.rs conducted with PagedAttention enabled, block size = 32

Please submit more benchmarks via raising an issue!

Supported models

Quantization support

Model	GGUF	GGML	ISQ
Mistral	✅		✅
Gemma			✅
Llama	✅	✅	✅
Mixtral	✅		✅
Phi 2	✅		✅
Phi 3	✅		✅
Phi 3.5 MoE			✅
Qwen 2.5			✅
Phi 3 Vision			✅
Idefics 2			✅
Gemma 2			✅
Starcoder 2		✅	✅
LLaVa Next			✅
LLaVa			✅
Llama 3.2 Vision			✅
Qwen2-VL			✅
Idefics 3			✅
Deepseek V2			✅
Deepseek V3			✅
MiniCPM-O 2.6			✅
Qwen2.5-VL			✅
Gemma 3			✅
Mistral 3			✅

Device mapping support

Model category	Supported
Plain	✅
GGUF	✅
GGML
Vision Plain	✅

X-LoRA and LoRA support

Model	X-LoRA	X-LoRA+GGUF	X-LoRA+GGML
Mistral	✅	✅
Gemma	✅
Llama	✅	✅	✅
Mixtral	✅	✅
Phi 2	✅
Phi 3	✅	✅
Phi 3.5 MoE
Qwen 2.5
Phi 3 Vision
Idefics 2
Gemma 2	✅
Starcoder 2	✅
LLaVa Next
LLaVa
Qwen2-VL
Idefics 3
Deepseek V2
Deepseek V3
MiniCPM-O 2.6
Qwen2.5-VL
Gemma 3
Mistral 3

AnyMoE support

Model	AnyMoE
Mistral 7B	✅
Gemma	✅
Llama	✅
Mixtral
Phi 2	✅
Phi 3	✅
Phi 3.5 MoE
Qwen 2.5	✅
Phi 3 Vision
Idefics 2
Gemma 2	✅
Starcoder 2	✅
LLaVa Next	✅
LLaVa	✅
Llama 3.2 Vision
Qwen2-VL
Idefics 3	✅
Deepseek V2
Deepseek V3
MiniCPM-O 2.6
Qwen2.5-VL
Gemma 3	✅
Mistral 3	✅

Using derivative model

To use a derivative model, select the model architecture using the correct subcommand. To see what can be passed for the architecture, pass --help after the subcommand. For example, when using a different model than the default, specify the following for the following types of models:

Plain: Model id
Quantized: Quantized model id, quantized filename, and tokenizer id
X-LoRA: Model id, X-LoRA ordering
X-LoRA quantized: Quantized model id, quantized filename, tokenizer id, and X-LoRA ordering
LoRA: Model id, LoRA ordering
LoRA quantized: Quantized model id, quantized filename, tokenizer id, and LoRA ordering
Vision Plain: Model id

See this section to determine if it is necessary to prepare an X-LoRA/LoRA ordering file, it is always necessary if the target modules or architecture changed, or if the adapter order changed.

It is also important to check the chat template style of the model. If the HF hub repo has a tokenizer_config.json file, it is not necessary to specify. Otherwise, templates can be found in chat_templates and should be passed before the subcommand. If the model is not instruction tuned, no chat template will be found and the APIs will only accept a prompt, no messages.

For example, when using a Zephyr model:

./mistralrs-server --port 1234 --log output.txt gguf -t HuggingFaceH4/zephyr-7b-beta -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q5_0.gguf

Adapter model support: X-LoRA and LoRA

An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting the x-lora-* architecture, and LoRA support by selecting the lora-* architecture. Please find docs for adapter models here. Examples may be found here.

Chat Templates and Tokenizer

Mistral.rs will attempt to automatically load a chat template and tokenizer. This enables high flexibility across models and ensures accurate and flexible chat templating. However, this behavior can be customized. Please find detailed documentation here.

Contributing

Thank you for contributing! If you have any problems or want to contribute something, please raise an issue or pull request. If you want to add a new model, please contact us via an issue and we can coordinate how to do this.

FAQ

Debugging with the environment variable MISTRALRS_DEBUG=1 causes the following things
- If loading a GGUF or GGML model, this will output a file containing the names, shapes, and types of each tensor.
  - mistralrs_gguf_tensors.txt or mistralrs_ggml_tensors.txt
- More logging.
Setting the CUDA compiler path:
- Set the NVCC_CCBIN environment variable during build.
Error: recompile with -fPIE:
- Some Linux distributions require compiling with -fPIE.
- Set the CUDA_NVCC_FLAGS environment variable to -fPIE during build: CUDA_NVCC_FLAGS=-fPIE
Error CUDA_ERROR_NOT_FOUND or symbol not found when using a normal or vison model:
- For non-quantized models, you can specify the data type to load and run in. This must be one of f32, f16, bf16 or auto to choose based on the device.
What is the minimum supported CUDA compute cap?
- The minimum CUDA compute cap is 5.3.

Credits

This project would not be possible without the excellent work at candle. Additionally, thank you to all contributors! Contributing can range from raising an issue or suggesting a feature to adding some new functionality.

⬆️ Back to Top

For Tasks:

Click tags to check more tools for each tasks

chat summarize translate generate code answer questions

For Jobs:

chatbot summarization translation code generation question answering

Alternative AI tools for mistral.rs

Similar Open Source Tools

mistral.rs

github

: 5.4k

rwkv.cpp

rwkv.cpp is a port of BlinkDL/RWKV-LM to ggerganov/ggml, supporting FP32, FP16, and quantized INT4, INT5, and INT8 inference. It focuses on CPU but also supports cuBLAS. The project provides a C library rwkv.h and a Python wrapper. RWKV is a large language model architecture with models like RWKV v5 and v6. It requires only state from the previous step for calculations, making it CPU-friendly on large context lengths. Users are advised to test all available formats for perplexity and latency on a representative dataset before serious use.

github

: 1.1k

aicommit2

AICommit2 is a Reactive CLI tool that streamlines interactions with various AI providers such as OpenAI, Anthropic Claude, Gemini, Mistral AI, Cohere, and unofficial providers like Huggingface and Clova X. Users can request multiple AI simultaneously to generate git commit messages without waiting for all AI responses. The tool runs 'git diff' to grab code changes, sends them to configured AI, and returns the AI-generated commit message. Users can set API keys or Cookies for different providers and configure options like locale, generate number of messages, commit type, proxy, timeout, max-length, and more. AICommit2 can be used both locally with Ollama and remotely with supported providers, offering flexibility and efficiency in generating commit messages.

github

: 242

foul-play

Foul Play is a Pokémon battle-bot that can play single battles in all generations on Pokemon Showdown. It requires Python 3.10+. The bot uses environment variables for configuration and supports different game modes and battle strategies. Users can specify teams and choose between algorithms like Monte-Carlo Tree Search and Expectiminimax. Foul Play can be run locally or with Docker, and the engine used for battles must be built from source. The tool provides flexibility in gameplay and strategy for Pokémon battles.

github

: 272

TPI-LLM

TPI-LLM (Tensor Parallelism Inference for Large Language Models) is a system designed to bring LLM functions to low-resource edge devices, addressing privacy concerns by enabling LLM inference on edge devices with limited resources. It leverages multiple edge devices for inference through tensor parallelism and a sliding window memory scheduler to minimize memory usage. TPI-LLM demonstrates significant improvements in TTFT and token latency compared to other models, and plans to support infinitely large models with low token latency in the future.

github

: 123

cortex.cpp

Cortex.cpp is an open-source platform designed as the brain for robots, offering functionalities such as vision, speech, language, tabular data processing, and action. It provides an AI platform for running AI models with multi-engine support, hardware optimization with automatic GPU detection, and an OpenAI-compatible API. Users can download models from the Hugging Face model hub, run models, manage resources, and access advanced features like multiple quantizations and engine management. The tool is under active development, promising rapid improvements for users.

github

: 2.6k

local-deep-research

Local Deep Research is a powerful AI-powered research assistant that performs deep, iterative analysis using multiple LLMs and web searches. It can be run locally for privacy or configured to use cloud-based LLMs for enhanced capabilities. The tool offers advanced research capabilities, flexible LLM support, rich output options, privacy-focused operation, enhanced search integration, and academic & scientific integration. It also provides a web interface, command line interface, and supports multiple LLM providers and search engines. Users can configure AI models, search engines, and research parameters for customized research experiences.

github

: 2.0k

distributed-llama

Distributed Llama is a tool that allows you to run large language models (LLMs) on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage. It uses TCP sockets to synchronize the state of the neural network, and you can easily configure your AI cluster by using a home router. Distributed Llama supports models such as Llama 2 (7B, 13B, 70B) chat and non-chat versions, Llama 3, and Grok-1 (314B).

github

: 2.0k

gollama

Gollama is a delightful tool that brings Ollama, your offline conversational AI companion, directly into your terminal. It provides a fun and interactive way to generate responses from various models without needing internet connectivity. Whether you're brainstorming ideas, exploring creative writing, or just looking for inspiration, Gollama is here to assist you. The tool offers an interactive interface, customizable prompts, multiple models selection, and visual feedback to enhance user experience. It can be installed via different methods like downloading the latest release, using Go, running with Docker, or building from source. Users can interact with Gollama through various options like specifying a custom base URL, prompt, model, and enabling raw output mode. The tool supports different modes like interactive, piped, CLI with image, and TUI with image. Gollama relies on third-party packages like bubbletea, glamour, huh, and lipgloss. The roadmap includes implementing piped mode, support for extracting codeblocks, copying responses/codeblocks to clipboard, GitHub Actions for automated releases, and downloading models directly from Ollama using the rest API. Contributions are welcome, and the project is licensed under the MIT License.

github

: 80

evalchemy

Evalchemy is a unified and easy-to-use toolkit for evaluating language models, focusing on post-trained models. It integrates multiple existing benchmarks such as RepoBench, AlpacaEval, and ZeroEval. Key features include unified installation, parallel evaluation, simplified usage, and results management. Users can run various benchmarks with a consistent command-line interface and track results locally or integrate with a database for systematic tracking and leaderboard submission.

github

: 317

graphrag-visualizer

GraphRAG Visualizer is an application designed to visualize Microsoft GraphRAG artifacts by uploading parquet files generated from the GraphRAG indexing pipeline. Users can view and analyze data in 2D or 3D graphs, display data tables, search for specific nodes or relationships, and process artifacts locally for data security and privacy.

github

: 301

ovos-installer

The ovos-installer is a simple and multilingual tool designed to install Open Voice OS and HiveMind using Bash, Whiptail, and Ansible. It supports various Linux distributions and provides an automated installation process. Users can easily start and stop services, update their Open Voice OS instance, and uninstall the tool if needed. The installer also allows for non-interactive installation through scenario files. It offers a user-friendly way to set up Open Voice OS on different systems.

github

: 138

rpaframework

RPA Framework is an open-source collection of libraries and tools for Robotic Process Automation (RPA), designed to be used with Robot Framework and Python. It offers well-documented core libraries for Software Robot Developers, optimized for Robocorp Control Room and Developer Tools, and accepts external contributions. The project includes various libraries for tasks like archiving, browser automation, date/time manipulations, cloud services integration, encryption operations, database interactions, desktop automation, document processing, email operations, Excel manipulation, file system operations, FTP interactions, web API interactions, image manipulation, AI services, and more. The development of the repository is Python-based and requires Python version 3.8+, with tooling based on poetry and invoke for compiling, building, and running the package. The project is licensed under the Apache License 2.0.

github

: 1.1k

StableToolBench

StableToolBench is a new benchmark developed to address the instability of Tool Learning benchmarks. It aims to balance stability and reality by introducing features such as a Virtual API System with caching and API simulators, a new set of solvable queries determined by LLMs, and a Stable Evaluation System using GPT-4. The Virtual API Server can be set up either by building from source or using a prebuilt Docker image. Users can test the server using provided scripts and evaluate models with Solvable Pass Rate and Solvable Win Rate metrics. The tool also includes model experiments results comparing different models' performance.

github

: 59

Large-Language-Models-play-StarCraftII

Large Language Models Play StarCraft II is a project that explores the capabilities of large language models (LLMs) in playing the game StarCraft II. The project introduces TextStarCraft II, a textual environment for the game, and a Chain of Summarization method for analyzing game information and making strategic decisions. Through experiments, the project demonstrates that LLM agents can defeat the built-in AI at a challenging difficulty level. The project provides benchmarks and a summarization approach to enhance strategic planning and interpretability in StarCraft II gameplay.

github

: 152

paperless-gpt

paperless-gpt is a tool designed to generate accurate and meaningful document titles and tags for paperless-ngx using Large Language Models (LLMs). It supports multiple LLM providers, including OpenAI and Ollama. With paperless-gpt, you can streamline your document management by automatically suggesting appropriate titles and tags based on the content of your scanned documents. The tool offers features like multiple LLM support, customizable prompts, easy integration with paperless-ngx, user-friendly interface for reviewing and applying suggestions, dockerized deployment, automatic document processing, and an experimental OCR feature.

github

: 724

For similar tasks

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

ai-guide

This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

github

: 159

onnxruntime-genai

ONNX Runtime Generative AI is a library that provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. Users can call a high level `generate()` method, or run each iteration of the model in a loop. It supports greedy/beam search and TopP, TopK sampling to generate token sequences, has built in logits processing like repetition penalties, and allows for easy custom scoring.

github

: 442

jupyter-ai

Jupyter AI connects generative AI with Jupyter notebooks. It provides a user-friendly and powerful way to explore generative AI models in notebooks and improve your productivity in JupyterLab and the Jupyter Notebook. Specifically, Jupyter AI offers: * An `%%ai` magic that turns the Jupyter notebook into a reproducible generative AI playground. This works anywhere the IPython kernel runs (JupyterLab, Jupyter Notebook, Google Colab, Kaggle, VSCode, etc.). * A native chat UI in JupyterLab that enables you to work with generative AI as a conversational assistant. * Support for a wide range of generative model providers, including AI21, Anthropic, AWS, Cohere, Gemini, Hugging Face, NVIDIA, and OpenAI. * Local model support through GPT4All, enabling use of generative AI models on consumer grade machines with ease and privacy.

github

: 3.5k

khoj

Khoj is an open-source, personal AI assistant that extends your capabilities by creating always-available AI agents. You can share your notes and documents to extend your digital brain, and your AI agents have access to the internet, allowing you to incorporate real-time information. Khoj is accessible on Desktop, Emacs, Obsidian, Web, and Whatsapp, and you can share PDF, markdown, org-mode, notion files, and GitHub repositories. You'll get fast, accurate semantic search on top of your docs, and your agents can create deeply personal images and understand your speech. Khoj is self-hostable and always will be.

github

: 28.5k

langchain_dart

LangChain.dart is a Dart port of the popular LangChain Python framework created by Harrison Chase. LangChain provides a set of ready-to-use components for working with language models and a standard interface for chaining them together to formulate more advanced use cases (e.g. chatbots, Q&A with RAG, agents, summarization, extraction, etc.). The components can be grouped into a few core modules: * **Model I/O:** LangChain offers a unified API for interacting with various LLM providers (e.g. OpenAI, Google, Mistral, Ollama, etc.), allowing developers to switch between them with ease. Additionally, it provides tools for managing model inputs (prompt templates and example selectors) and parsing the resulting model outputs (output parsers). * **Retrieval:** assists in loading user data (via document loaders), transforming it (with text splitters), extracting its meaning (using embedding models), storing (in vector stores) and retrieving it (through retrievers) so that it can be used to ground the model's responses (i.e. Retrieval-Augmented Generation or RAG). * **Agents:** "bots" that leverage LLMs to make informed decisions about which available tools (such as web search, calculators, database lookup, etc.) to use to accomplish the designated task. The different components can be composed together using the LangChain Expression Language (LCEL).

github

: 497

danswer

Danswer is an open-source Gen-AI Chat and Unified Search tool that connects to your company's docs, apps, and people. It provides a Chat interface and plugs into any LLM of your choice. Danswer can be deployed anywhere and for any scale - on a laptop, on-premise, or to cloud. Since you own the deployment, your user data and chats are fully in your own control. Danswer is MIT licensed and designed to be modular and easily extensible. The system also comes fully ready for production usage with user authentication, role management (admin/basic users), chat persistence, and a UI for configuring Personas (AI Assistants) and their Prompts. Danswer also serves as a Unified Search across all common workplace tools such as Slack, Google Drive, Confluence, etc. By combining LLMs and team specific knowledge, Danswer becomes a subject matter expert for the team. Imagine ChatGPT if it had access to your team's unique knowledge! It enables questions such as "A customer wants feature X, is this already supported?" or "Where's the pull request for feature Y?"

github

: 10.5k

infinity

Infinity is an AI-native database designed for LLM applications, providing incredibly fast full-text and vector search capabilities. It supports a wide range of data types, including vectors, full-text, and structured data, and offers a fused search feature that combines multiple embeddings and full text. Infinity is easy to use, with an intuitive Python API and a single-binary architecture that simplifies deployment. It achieves high performance, with 0.1 milliseconds query latency on million-scale vector datasets and up to 15K QPS.

github

: 3.3k

For similar jobs

h2ogpt

h2oGPT is an Apache V2 open-source project that allows users to query and summarize documents or chat with local private GPT LLMs. It features a private offline database of any documents (PDFs, Excel, Word, Images, Video Frames, Youtube, Audio, Code, Text, MarkDown, etc.), a persistent database (Chroma, Weaviate, or in-memory FAISS) using accurate embeddings (instructor-large, all-MiniLM-L6-v2, etc.), and efficient use of context using instruct-tuned LLMs (no need for LangChain's few-shot approach). h2oGPT also offers parallel summarization and extraction, reaching an output of 80 tokens per second with the 13B LLaMa2 model, HYDE (Hypothetical Document Embeddings) for enhanced retrieval based upon LLM responses, a variety of models supported (LLaMa2, Mistral, Falcon, Vicuna, WizardLM. With AutoGPTQ, 4-bit/8-bit, LORA, etc.), GPU support from HF and LLaMa.cpp GGML models, and CPU support using HF, LLaMa.cpp, and GPT4ALL models. Additionally, h2oGPT provides Attention Sinks for arbitrarily long generation (LLaMa-2, Mistral, MPT, Pythia, Falcon, etc.), a UI or CLI with streaming of all models, the ability to upload and view documents through the UI (control multiple collaborative or personal collections), Vision Models LLaVa, Claude-3, Gemini-Pro-Vision, GPT-4-Vision, Image Generation Stable Diffusion (sdxl-turbo, sdxl) and PlaygroundAI (playv2), Voice STT using Whisper with streaming audio conversion, Voice TTS using MIT-Licensed Microsoft Speech T5 with multiple voices and Streaming audio conversion, Voice TTS using MPL2-Licensed TTS including Voice Cloning and Streaming audio conversion, AI Assistant Voice Control Mode for hands-free control of h2oGPT chat, Bake-off UI mode against many models at the same time, Easy Download of model artifacts and control over models like LLaMa.cpp through the UI, Authentication in the UI by user/password via Native or Google OAuth, State Preservation in the UI by user/password, Linux, Docker, macOS, and Windows support, Easy Windows Installer for Windows 10 64-bit (CPU/CUDA), Easy macOS Installer for macOS (CPU/M1/M2), Inference Servers support (oLLaMa, HF TGI server, vLLM, Gradio, ExLLaMa, Replicate, OpenAI, Azure OpenAI, Anthropic), OpenAI-compliant, Server Proxy API (h2oGPT acts as drop-in-replacement to OpenAI server), Python client API (to talk to Gradio server), JSON Mode with any model via code block extraction. Also supports MistralAI JSON mode, Claude-3 via function calling with strict Schema, OpenAI via JSON mode, and vLLM via guided_json with strict Schema, Web-Search integration with Chat and Document Q/A, Agents for Search, Document Q/A, Python Code, CSV frames (Experimental, best with OpenAI currently), Evaluate performance using reward models, and Quality maintained with over 1000 unit and integration tests taking over 4 GPU-hours.

github

: 11.7k

mistral.rs

github

: 5.4k

ollama

Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications. Ollama is designed to be easy to use and accessible to developers of all levels. It is open source and available for free on GitHub.

github

: 135.5k

llama-cpp-agent

The llama-cpp-agent framework is a tool designed for easy interaction with Large Language Models (LLMs). Allowing users to chat with LLM models, execute structured function calls and get structured output (objects). It provides a simple yet robust interface and supports llama-cpp-python and OpenAI endpoints with GBNF grammar support (like the llama-cpp-python server) and the llama.cpp backend server. It works by generating a formal GGML-BNF grammar of the user defined structures and functions, which is then used by llama.cpp to generate text valid to that grammar. In contrast to most GBNF grammar generators it also supports nested objects, dictionaries, enums and lists of them.

github

: 454

llama_ros

This repository provides a set of ROS 2 packages to integrate llama.cpp into ROS 2. By using the llama_ros packages, you can easily incorporate the powerful optimization capabilities of llama.cpp into your ROS 2 projects by running GGUF-based LLMs and VLMs.

github

: 195

MITSUHA

OneReality is a virtual waifu/assistant that you can speak to through your mic and it'll speak back to you! It has many features such as: * You can speak to her with a mic * It can speak back to you * Has short-term memory and long-term memory * Can open apps * Smarter than you * Fluent in English, Japanese, Korean, and Chinese * Can control your smart home like Alexa if you set up Tuya (more info in Prerequisites) It is built with Python, Llama-cpp-python, Whisper, SpeechRecognition, PocketSphinx, VITS-fast-fine-tuning, VITS-simple-api, HyperDB, Sentence Transformers, and Tuya Cloud IoT.

github

: 174

wenxin-starter

WenXin-Starter is a spring-boot-starter for Baidu's "Wenxin Qianfan WENXINWORKSHOP" large model, which can help you quickly access Baidu's AI capabilities. It fully integrates the official API documentation of Wenxin Qianfan. Supports text-to-image generation, built-in dialogue memory, and supports streaming return of dialogue. Supports QPS control of a single model and supports queuing mechanism. Plugins will be added soon.

github

: 207

FlexFlow

FlexFlow Serve is an open-source compiler and distributed system for **low latency**, **high performance** LLM serving. FlexFlow Serve outperforms existing systems by 1.3-2.0x for single-node, multi-GPU inference and by 1.4-2.4x for multi-node, multi-GPU inference.

github

: 1.7k

mistral.rs

README:

mistral.rs

Blazingly fast LLM inference.

Get started fast 🚀

Quick examples

Credit

Description

Architecture Support matrix

APIs and Integrations

Rust Crate

Python API

HTTP Server

Llama Index integration (Python)

Supported accelerators

Installation and Build

Getting models

Getting models from Hugging Face Hub

Loading models from local files:

Running GGUF models

With a specified tokenizer

With the builtin tokenizer

Run with the CLI

Architecture for plain models

Architecture for vision models

Supported GGUF architectures

Interactive mode

OpenAI HTTP server

Structured selection with a .toml file

Benchmarks

Supported models

Using derivative model

Adapter model support: X-LoRA and LoRA

Chat Templates and Tokenizer

Contributing

FAQ

Credits

For Tasks:

For Jobs:

Alternative AI tools for mistral.rs

Similar Open Source Tools

mistral.rs

rwkv.cpp

aicommit2

foul-play

TPI-LLM

cortex.cpp

local-deep-research

distributed-llama

gollama

evalchemy

graphrag-visualizer

ovos-installer

rpaframework

StableToolBench

Large-Language-Models-play-StarCraftII

paperless-gpt

For similar tasks

LLMStack

ai-guide

onnxruntime-genai

jupyter-ai

khoj

langchain_dart

danswer

infinity

For similar jobs

h2ogpt

mistral.rs

ollama

llama-cpp-agent

llama_ros

MITSUHA

wenxin-starter

FlexFlow

Structured selection with a `.toml` file