mistral.rs
Blazingly fast LLM inference.
Stars: 3742
Mistral.rs is a fast LLM inference platform written in Rust. We support inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.
README:
| Rust Documentation | Python Documentation | Discord | Matrix |
Please submit requests for new models here.
-
Deploy with our easy to use APIs
After following installation instructions
-
Check out UQFF for prequantized models of various methods!
- Models can be found here.
-
🦙📷 Run the Llama 3.2 Vision Model: documentation and guide here
./mistralrs-server -i vision-plain -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k -a vllama
-
🔥🧠 AnyMoE: Build a memory-efficient MoE model from anything, in seconds
./mistralrs-server -i toml -f toml-selectors/anymoe_lora.toml
-
φ³ Run the new Phi 3.5/3.1/3 model with 128K context window
./mistralrs-server -i plain -m microsoft/Phi-3.5-mini-instruct -a phi3
-
🌀 Run the Phi 3.5 MoE model with 128K context window: documentation and guide here
./mistralrs-server -i plain -m microsoft/Phi-3.5-MoE-instruct -a phi3.5moe
-
φ³ 📷 Run the Phi 3 vision model: documentation and guide here
./mistralrs-server --port 1234 vision-plain -m microsoft/Phi-3.5-vision-instruct -a phi3v
-
🌲📷 Run the FLUX.1 diffusion model: documentation and guide here
./mistralrs-server --port 1234 diffusion-plain -m black-forest-labs/FLUX.1-schnell -a flux
-
Other models: see a support matrix and how to run them
Mistral.rs supports several model categories:
- Text to Text
- Text+Image to Text: Vision (see the docs)
- Text to Image: Image Generation (see the docs)
Easy:
- Lightweight OpenAI API compatible HTTP server
- Python API
- Grammar support with Regex and Yacc
-
ISQ (In situ quantization): run
.safetensors
models directly from 🤗 Hugging Face by quantizing in-place
Fast:
- Apple silicon support: ARM NEON, Accelerate, Metal
- Accelerated CPU inference with MKL, AVX support
- CUDA support with flash attention and cuDNN.
- Device mapping: load and run some layers on the device and the rest on the CPU.
Quantization:
- Details
- GGML: 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit, with ISQ support.
- GPTQ: 2-bit, 3-bit, 4-bit and 8-bit, with Marlin kernel support in 4-bit and 8-bit.
- HQQ: 4-bit and 8 bit, with ISQ support
Powerful:
- LoRA support with weight merging
- First X-LoRA inference platform with first class support
- AnyMoE: Build a memory-efficient MoE model from anything, in seconds
- Various sampling and penalty methods
- Tool calling: docs
- Prompt chunking: process large prompts in a more manageable way
Advanced features:
- PagedAttention and continuous batching
- Prefix caching
- Topology: Configure ISQ and device mapping easily
- UQFF: Quantized file format for easy mixing of quants, collection here.
- Speculative Decoding: Mix supported models as the draft model or the target model
- Dynamic LoRA adapter activation with adapter preloading: examples and docs
Documentation for mistral.rs can be found here.
This is a demo of interactive mode with streaming running Phi 3 128k mini with quantization via ISQ to Q4K.
https://github.com/EricLBuehler/mistral.rs/assets/65165915/09d9a30f-1e22-4b9a-9006-4ec6ebc6473c
Note: See supported models for more information
Model | Supports quantization | Supports adapters | Supports device mapping | Supported by AnyMoE |
---|---|---|---|---|
Mistral v0.1/v0.2/v0.3 | ✅ | ✅ | ✅ | ✅ |
Gemma | ✅ | ✅ | ✅ | ✅ |
Llama 3.1/3.2 | ✅ | ✅ | ✅ | ✅ |
Mixtral | ✅ | ✅ | ✅ | |
Phi 2 | ✅ | ✅ | ✅ | ✅ |
Phi 3 | ✅ | ✅ | ✅ | ✅ |
Phi 3.5 MoE | ✅ | ✅ | ||
Qwen 2.5 | ✅ | ✅ | ✅ | |
Phi 3 Vision | ✅ | ✅ | ✅ | |
Idefics 2 | ✅ | ✅ | ✅ | |
Gemma 2 | ✅ | ✅ | ✅ | ✅ |
Starcoder 2 | ✅ | ✅ | ✅ | ✅ |
LLaVa Next | ✅ | ✅ | ✅ | |
LLaVa | ✅ | ✅ | ✅ | |
Llama 3.2 Vision | ✅ | ✅ |
Rust multithreaded/async API for easy integration into any application.
- Docs
- Examples
- To install: Add
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git" }
Python API for mistral.rs.
OpenAI API compatible API server
- CUDA:
- Compile with the
cuda
feature:--features cuda
- FlashAttention support: compile with the
flash-attn
feature - cuDNN support: compile with the
cudnn
feature:--features cudnn
- Compile with the
- Metal:
- Compile with the
metal
feature:--features metal
- Compile with the
- CPU:
- Intel MKL: compile with the
mkl
feature:--features mkl
- Apple Accelerate: compile with the
accelerate
feature:--features accelerate
- ARM NEON and AVX are used automatically
- Intel MKL: compile with the
Enabling features is done by passing --features ...
to the build system. When using cargo run
or maturin develop
, pass the --features
flag before the --
separating build flags from runtime flags.
- To enable a single feature like
metal
:cargo build --release --features metal
. - To enable multiple features, specify them in quotes:
cargo build --release --features "cuda flash-attn cudnn"
.
Note: You can use our Docker containers here. Learn more about running Docker containers: https://docs.docker.com/engine/reference/run/
Note: You can use pre-built
mistralrs-server
binaries here
- Install the Python package here.
-
Install required packages:
-
OpenSSL
(Example on Ubuntu:sudo apt install libssl-dev
) -
Linux only:
pkg-config
(Example on Ubuntu:sudo apt install pkg-config
)
-
-
Install Rust: https://rustup.rs/
Example on Ubuntu:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source $HOME/.cargo/env
-
Optional: Set HF token correctly (skip if already set or your model is not gated, or if you want to use the
token_source
parameters in Python or the command line.)- Note: you can install
huggingface-cli
as documented here.
huggingface-cli login
- Note: you can install
-
Download the code:
git clone https://github.com/EricLBuehler/mistral.rs.git cd mistral.rs
-
Build or install:
-
Base build command
cargo build --release
-
Build with CUDA support
cargo build --release --features cuda
-
Build with CUDA and Flash Attention V2 support
cargo build --release --features "cuda flash-attn"
-
Build with Metal support
cargo build --release --features metal
-
Build with Accelerate support
cargo build --release --features accelerate
-
Build with MKL support
cargo build --release --features mkl
-
Install with
cargo install
for easy command line usagePass the same values to
--features
as you would forcargo build
cargo install --path mistralrs-server --features cuda
-
-
The build process will output a binary
mistralrs-server
at./target/release/mistralrs-server
which may be copied into the working directory with the following command:Example on Ubuntu:
cp ./target/release/mistralrs-server ./mistralrs-server
-
Use our APIs and integrations:
There are 2 ways to get models with mistral.rs:
- From Hugging Face Hub (easiest)
- From local files
- Running a GGUF model
- Specify local paths
Mistral.rs can automatically download models from HF Hub. To access gated models, you should provide a token source. They may be one of:
-
literal:<value>
: Load from a specified literal -
env:<value>
: Load from a specified environment variable -
path:<value>
: Load from a specified file -
cache
: default: Load from the HF token at ~/.cache/huggingface/token or equivalent. -
none
: Use no HF token
This is passed in the following ways:
- Command line:
./mistralrs-server --token-source none -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3
- Python:
Here is an example of setting the token source.
If token cannot be loaded, no token will be used (i.e. effectively using none
).
You can also instruct mistral.rs to load models fully locally by modifying the *_model_id
arguments or options:
./mistralrs-server --port 1234 plain -m . -a mistral
Throughout mistral.rs, any model ID argument or option may be a local path and should contain the following files for each model ID option:
-
--model-id
(server) ormodel_id
(python/rust) or--tok-model-id
(server) ortok_model_id
(python/rust):config.json
tokenizer_config.json
-
tokenizer.json
(if not specified separately) -
.safetensors
/.bin
/.pth
/.pt
files (defaults to.safetensors
) -
preprocessor_config.json
(required for vision models). -
processor_config.json
(optional for vision models).
-
--quantized-model-id
(server) orquantized_model_id
(python/rust):- Specified
.gguf
or.ggml
file.
- Specified
-
--x-lora-model-id
(server) orxlora_model_id
(python/rust):xlora_classifier.safetensors
xlora_config.json
- Adapters
.safetensors
andadapter_config.json
files in their respective directories
-
--adapters-model-id
(server) oradapters_model_id
(python/rust):- Adapters
.safetensors
andadapter_config.json
files in their respective directories
- Adapters
To run GGUF models, the only mandatory arguments are the quantized model ID and the quantized filename. The quantized model ID can be a HF model ID.
GGUF models contain a tokenizer. However, mistral.rs allows you to run the model with a tokenizer from a specified model, typically the official one. This means there are two options:
Running with a tokenizer model ID enables you to specify the model ID to source the tokenizer from:
./mistralrs-server gguf -m bartowski/Phi-3.5-mini-instruct-GGUF -f Phi-3.5-mini-instruct-Q4_K_M.gguf -t microsoft/Phi-3.5-mini-instruct
If the specified tokenizer model ID contains a tokenizer.json
, then it will be used over the GGUF tokenizer.
Using the builtin tokenizer:
./mistralrs-server gguf -m bartowski/Phi-3.5-mini-instruct-GGUF -f Phi-3.5-mini-instruct-Q4_K_M.gguf
(or using a local file):
./mistralrs-server gguf -m path/to/files -f Phi-3.5-mini-instruct-Q4_K_M.gguf
There are a few more ways to configure:
Chat template:
The chat template can be automatically detected and loaded from the GGUF file if no other chat template source is specified including the tokenizer model ID.
If that does not work, you can either provide a tokenizer (recommended), or specify a custom chat template.
./mistralrs-server --chat-template <chat_template> gguf -m . -f Phi-3.5-mini-instruct-Q4_K_M.gguf
Tokenizer
The following tokenizer model types are currently supported. If you would like one to be added, please raise an issue. Otherwise, please consider using the method demonstrated in examples below, where the tokenizer is sourced from Hugging Face.
Supported GGUF tokenizer types
-
llama
(sentencepiece) -
gpt2
(BPE)
Mistral.rs uses subcommands to control the model type. They are generally of format <XLORA/LORA>-<QUANTIZATION>
. Please run ./mistralrs-server --help
to see the subcommands.
Note: for plain models, you can specify the data type to load and run in. This must be one of
f32
,f16
,bf16
orauto
to choose based on the device. This is specified in the--dype
/-d
parameter after the model architecture (plain
).
If you do not specify the architecture, an attempt will be made to use the model's config. If this fails, please raise an issue.
mistral
gemma
mixtral
llama
phi2
phi3
phi3.5moe
qwen2
gemma2
starcoder2
Note: for vision models, you can specify the data type to load and run in. This must be one of
f32
,f16
,bf16
orauto
to choose based on the device. This is specified in the--dype
/-d
parameter after the model architecture (vision-plain
).
phi3v
idefics2
llava_next
llava
vllama
Plain:
llama
phi2
phi3
starcoder2
qwen2
With adapters:
llama
phi3
You can launch interactive mode, a simple chat application running in the terminal, by passing -i
:
./mistralrs-server -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3
Vision models work too:
./mistralrs-server -i vision-plain -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k -a vllama
And even diffusion models:
./mistralrs-server -i diffusion-plain -m black-forest-labs/FLUX.1-schnell -a flux
You can an HTTP server
./mistralrs-server --port 1234 plain -m microsoft/Phi-3.5-MoE-instruct -a phi3.5moe
We provide a method to select models with a .toml
file. The keys are the same as the command line, with no_kv_cache
and tokenizer_json
being "global" keys.
Example:
./mistralrs-server --port 1234 toml -f toml-selectors/gguf.toml
Device | Mistral.rs Completion T/s | Llama.cpp Completion T/s | Model | Quant |
---|---|---|---|---|
A10 GPU, CUDA | 86 | 83 | mistral-7b | 4_K_M |
Intel Xeon 8358 CPU, AVX | 11 | 23 | mistral-7b | 4_K_M |
Raspberry Pi 5 (8GB), Neon | 2 | 3 | mistral-7b | 2_K |
A100 GPU, CUDA | 131 | 134 | mistral-7b | 4_K_M |
RTX 6000 GPU, CUDA | 103 | 96 | mistral-7b | 4_K_M |
Note: All CUDA tests for mistral.rs conducted with PagedAttention enabled, block size = 32
Please submit more benchmarks via raising an issue!
Quantization support
Model | GGUF | GGML | ISQ |
---|---|---|---|
Mistral | ✅ | ✅ | |
Gemma | ✅ | ||
Llama | ✅ | ✅ | ✅ |
Mixtral | ✅ | ✅ | |
Phi 2 | ✅ | ✅ | |
Phi 3 | ✅ | ✅ | |
Phi 3.5 MoE | ✅ | ||
Qwen 2.5 | ✅ | ||
Phi 3 Vision | ✅ | ||
Idefics 2 | ✅ | ||
Gemma 2 | ✅ | ||
Starcoder 2 | ✅ | ✅ | |
LLaVa Next | ✅ | ||
LLaVa | ✅ | ||
Llama 3.2 Vision | ✅ |
Device mapping support
Model category | Supported |
---|---|
Plain | ✅ |
GGUF | ✅ |
GGML | |
Vision Plain | ✅ |
X-LoRA and LoRA support
Model | X-LoRA | X-LoRA+GGUF | X-LoRA+GGML |
---|---|---|---|
Mistral | ✅ | ✅ | |
Gemma | ✅ | ||
Llama | ✅ | ✅ | ✅ |
Mixtral | ✅ | ✅ | |
Phi 2 | ✅ | ||
Phi 3 | ✅ | ✅ | |
Phi 3.5 MoE | |||
Qwen 2.5 | |||
Phi 3 Vision | |||
Idefics 2 | |||
Gemma 2 | ✅ | ||
Starcoder 2 | ✅ | ||
LLaVa Next | |||
LLaVa | |||
Llama 3.2 Vision |
AnyMoE support
Model | AnyMoE |
---|---|
Mistral 7B | ✅ |
Gemma | ✅ |
Llama | ✅ |
Mixtral | |
Phi 2 | ✅ |
Phi 3 | ✅ |
Phi 3.5 MoE | |
Qwen 2.5 | ✅ |
Phi 3 Vision | |
Idefics 2 | |
Gemma 2 | ✅ |
Starcoder 2 | ✅ |
LLaVa Next | ✅ |
LLaVa | ✅ |
Llama 3.2 Vision |
To use a derivative model, select the model architecture using the correct subcommand. To see what can be passed for the architecture, pass --help
after the subcommand. For example, when using a different model than the default, specify the following for the following types of models:
- Plain: Model id
- Quantized: Quantized model id, quantized filename, and tokenizer id
- X-LoRA: Model id, X-LoRA ordering
- X-LoRA quantized: Quantized model id, quantized filename, tokenizer id, and X-LoRA ordering
- LoRA: Model id, LoRA ordering
- LoRA quantized: Quantized model id, quantized filename, tokenizer id, and LoRA ordering
- Vision Plain: Model id
See this section to determine if it is necessary to prepare an X-LoRA/LoRA ordering file, it is always necessary if the target modules or architecture changed, or if the adapter order changed.
It is also important to check the chat template style of the model. If the HF hub repo has a tokenizer_config.json
file, it is not necessary to specify. Otherwise, templates can be found in chat_templates
and should be passed before the subcommand. If the model is not instruction tuned, no chat template will be found and the APIs will only accept a prompt, no messages.
For example, when using a Zephyr model:
./mistralrs-server --port 1234 --log output.txt gguf -t HuggingFaceH4/zephyr-7b-beta -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q5_0.gguf
An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting the x-lora-*
architecture, and LoRA support by selecting the lora-*
architecture. Please find docs for adapter models here. Examples may be found here.
Mistral.rs will attempt to automatically load a chat template and tokenizer. This enables high flexibility across models and ensures accurate and flexible chat templating. However, this behavior can be customized. Please find detailed documentation here.
Thank you for contributing! If you have any problems or want to contribute something, please raise an issue or pull request. If you want to add a new model, please contact us via an issue and we can coordinate how to do this.
- Debugging with the environment variable
MISTRALRS_DEBUG=1
causes the following things- If loading a GGUF or GGML model, this will output a file containing the names, shapes, and types of each tensor.
-
mistralrs_gguf_tensors.txt
ormistralrs_ggml_tensors.txt
-
- More logging.
- If loading a GGUF or GGML model, this will output a file containing the names, shapes, and types of each tensor.
- Setting the CUDA compiler path:
- Set the
NVCC_CCBIN
environment variable during build.
- Set the
- Error:
recompile with -fPIE
:- Some Linux distributions require compiling with
-fPIE
. - Set the
CUDA_NVCC_FLAGS
environment variable to-fPIE
during build:CUDA_NVCC_FLAGS=-fPIE
- Some Linux distributions require compiling with
- Error
CUDA_ERROR_NOT_FOUND
or symbol not found when using a normal or vison model:- For non-quantized models, you can specify the data type to load and run in. This must be one of
f32
,f16
,bf16
orauto
to choose based on the device.
- For non-quantized models, you can specify the data type to load and run in. This must be one of
This project would not be possible without the excellent work at candle
. Additionally, thank you to all contributors! Contributing can range from raising an issue or suggesting a feature to adding some new functionality.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for mistral.rs
Similar Open Source Tools
mistral.rs
Mistral.rs is a fast LLM inference platform written in Rust. We support inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.
aicommit2
AICommit2 is a Reactive CLI tool that streamlines interactions with various AI providers such as OpenAI, Anthropic Claude, Gemini, Mistral AI, Cohere, and unofficial providers like Huggingface and Clova X. Users can request multiple AI simultaneously to generate git commit messages without waiting for all AI responses. The tool runs 'git diff' to grab code changes, sends them to configured AI, and returns the AI-generated commit message. Users can set API keys or Cookies for different providers and configure options like locale, generate number of messages, commit type, proxy, timeout, max-length, and more. AICommit2 can be used both locally with Ollama and remotely with supported providers, offering flexibility and efficiency in generating commit messages.
rwkv.cpp
rwkv.cpp is a port of BlinkDL/RWKV-LM to ggerganov/ggml, supporting FP32, FP16, and quantized INT4, INT5, and INT8 inference. It focuses on CPU but also supports cuBLAS. The project provides a C library rwkv.h and a Python wrapper. RWKV is a large language model architecture with models like RWKV v5 and v6. It requires only state from the previous step for calculations, making it CPU-friendly on large context lengths. Users are advised to test all available formats for perplexity and latency on a representative dataset before serious use.
TPI-LLM
TPI-LLM (Tensor Parallelism Inference for Large Language Models) is a system designed to bring LLM functions to low-resource edge devices, addressing privacy concerns by enabling LLM inference on edge devices with limited resources. It leverages multiple edge devices for inference through tensor parallelism and a sliding window memory scheduler to minimize memory usage. TPI-LLM demonstrates significant improvements in TTFT and token latency compared to other models, and plans to support infinitely large models with low token latency in the future.
ps-fuzz
The Prompt Fuzzer is an open-source tool that helps you assess the security of your GenAI application's system prompt against various dynamic LLM-based attacks. It provides a security evaluation based on the outcome of these attack simulations, enabling you to strengthen your system prompt as needed. The Prompt Fuzzer dynamically tailors its tests to your application's unique configuration and domain. The Fuzzer also includes a Playground chat interface, giving you the chance to iteratively improve your system prompt, hardening it against a wide spectrum of generative AI attacks.
MockingBird
MockingBird is a toolbox designed for Mandarin speech synthesis using PyTorch. It supports multiple datasets such as aidatatang_200zh, magicdata, aishell3, and data_aishell. The toolbox can run on Windows, Linux, and M1 MacOS, providing easy and effective speech synthesis with pretrained encoder/vocoder models. It is webserver ready for remote calling. Users can train their own models or use existing ones for the encoder, synthesizer, and vocoder. The toolbox offers a demo video and detailed setup instructions for installation and model training.
gollama
Gollama is a delightful tool that brings Ollama, your offline conversational AI companion, directly into your terminal. It provides a fun and interactive way to generate responses from various models without needing internet connectivity. Whether you're brainstorming ideas, exploring creative writing, or just looking for inspiration, Gollama is here to assist you. The tool offers an interactive interface, customizable prompts, multiple models selection, and visual feedback to enhance user experience. It can be installed via different methods like downloading the latest release, using Go, running with Docker, or building from source. Users can interact with Gollama through various options like specifying a custom base URL, prompt, model, and enabling raw output mode. The tool supports different modes like interactive, piped, CLI with image, and TUI with image. Gollama relies on third-party packages like bubbletea, glamour, huh, and lipgloss. The roadmap includes implementing piped mode, support for extracting codeblocks, copying responses/codeblocks to clipboard, GitHub Actions for automated releases, and downloading models directly from Ollama using the rest API. Contributions are welcome, and the project is licensed under the MIT License.
scrape-it-now
Scrape It Now is a versatile tool for scraping websites with features like decoupled architecture, CLI functionality, idempotent operations, and content storage options. The tool includes a scraper component for efficient scraping, ad blocking, link detection, markdown extraction, dynamic content loading, and anonymity features. It also offers an indexer component for creating AI search indexes, chunking content, embedding chunks, and enabling semantic search. The tool supports various configurations for Azure services and local storage, providing flexibility and scalability for web scraping and indexing tasks.
optillm
optillm is an OpenAI API compatible optimizing inference proxy implementing state-of-the-art techniques to enhance accuracy and performance of LLMs, focusing on reasoning over coding, logical, and mathematical queries. By leveraging additional compute at inference time, it surpasses frontier models across diverse tasks.
clearml-fractional-gpu
ClearML Fractional GPU is a tool designed to optimize GPU resource utilization by allowing multiple containers to run on the same GPU with driver-level memory limitation and compute time-slicing. It supports CUDA 11.x & CUDA 12.x, preventing greedy processes from grabbing the entire GPU memory. The tool offers options like Dynamic GPU Slicing, Container-based Memory Limits, and Kubernetes-based Static MIG Slicing to enhance hardware utilization and workload performance for AI development.
airflow-chart
This Helm chart bootstraps an Airflow deployment on a Kubernetes cluster using the Helm package manager. The version of this chart does not correlate to any other component. Users should not expect feature parity between OSS airflow chart and the Astronomer airflow-chart for identical version numbers. To install this helm chart remotely (using helm 3) kubectl create namespace airflow helm repo add astronomer https://helm.astronomer.io helm install airflow --namespace airflow astronomer/airflow To install this repository from source sh kubectl create namespace airflow helm install --namespace airflow . Prerequisites: Kubernetes 1.12+ Helm 3.6+ PV provisioner support in the underlying infrastructure Installing the Chart: sh helm install --name my-release . The command deploys Airflow on the Kubernetes cluster in the default configuration. The Parameters section lists the parameters that can be configured during installation. Upgrading the Chart: First, look at the updating documentation to identify any backwards-incompatible changes. To upgrade the chart with the release name `my-release`: sh helm upgrade --name my-release . Uninstalling the Chart: To uninstall/delete the `my-release` deployment: sh helm delete my-release The command removes all the Kubernetes components associated with the chart and deletes the release. Updating DAGs: Bake DAGs in Docker image The recommended way to update your DAGs with this chart is to build a new docker image with the latest code (`docker build -t my-company/airflow:8a0da78 .`), push it to an accessible registry (`docker push my-company/airflow:8a0da78`), then update the Airflow pods with that image: sh helm upgrade my-release . --set images.airflow.repository=my-company/airflow --set images.airflow.tag=8a0da78 Docker Images: The Airflow image that are referenced as the default values in this chart are generated from this repository: https://github.com/astronomer/ap-airflow. Other non-airflow images used in this chart are generated from this repository: https://github.com/astronomer/ap-vendor. Parameters: The complete list of parameters supported by the community chart can be found on the Parameteres Reference page, and can be set under the `airflow` key in this chart. The following tables lists the configurable parameters of the Astronomer chart and their default values. | Parameter | Description | Default | | :----------------------------- | :-------------------------------------------------------------------------------------------------------- | :---------------------------- | | `ingress.enabled` | Enable Kubernetes Ingress support | `false` | | `ingress.acme` | Add acme annotations to Ingress object | `false` | | `ingress.tlsSecretName` | Name of secret that contains a TLS secret | `~` | | `ingress.webserverAnnotations` | Annotations added to Webserver Ingress object | `{}` | | `ingress.flowerAnnotations` | Annotations added to Flower Ingress object | `{}` | | `ingress.baseDomain` | Base domain for VHOSTs | `~` | | `ingress.auth.enabled` | Enable auth with Astronomer Platform | `true` | | `extraObjects` | Extra K8s Objects to deploy (these are passed through `tpl`). More about Extra Objects. | `[]` | | `sccEnabled` | Enable security context constraints required for OpenShift | `false` | | `authSidecar.enabled` | Enable authSidecar | `false` | | `authSidecar.repository` | The image for the auth sidecar proxy | `nginxinc/nginx-unprivileged` | | `authSidecar.tag` | The image tag for the auth sidecar proxy | `stable` | | `authSidecar.pullPolicy` | The K8s pullPolicy for the the auth sidecar proxy image | `IfNotPresent` | | `authSidecar.port` | The port the auth sidecar exposes | `8084` | | `gitSyncRelay.enabled` | Enables git sync relay feature. | `False` | | `gitSyncRelay.repo.url` | Upstream URL to the git repo to clone. | `~` | | `gitSyncRelay.repo.branch` | Branch of the upstream git repo to checkout. | `main` | | `gitSyncRelay.repo.depth` | How many revisions to check out. Leave as default `1` except in dev where history is needed. | `1` | | `gitSyncRelay.repo.wait` | Seconds to wait before pulling from the upstream remote. | `60` | | `gitSyncRelay.repo.subPath` | Path to the dags directory within the git repository. | `~` | Specify each parameter using the `--set key=value[,key=value]` argument to `helm install`. For example, sh helm install --name my-release --set executor=CeleryExecutor --set enablePodLaunching=false . Walkthrough using kind: Install kind, and create a cluster We recommend testing with Kubernetes 1.25+, example: sh kind create cluster --image kindest/node:v1.25.11 Confirm it's up: sh kubectl cluster-info --context kind-kind Add Astronomer's Helm repo sh helm repo add astronomer https://helm.astronomer.io helm repo update Create namespace + install the chart sh kubectl create namespace airflow helm install airflow -n airflow astronomer/airflow It may take a few minutes. Confirm the pods are up: sh kubectl get pods --all-namespaces helm list -n airflow Run `kubectl port-forward svc/airflow-webserver 8080:8080 -n airflow` to port-forward the Airflow UI to http://localhost:8080/ to confirm Airflow is working. Login as _admin_ and password _admin_. Build a Docker image from your DAGs: 1. Start a project using astro-cli, which will generate a Dockerfile, and load your DAGs in. You can test locally before pushing to kind with `astro airflow start`. `sh mkdir my-airflow-project && cd my-airflow-project astro dev init` 2. Then build the image: `sh docker build -t my-dags:0.0.1 .` 3. Load the image into kind: `sh kind load docker-image my-dags:0.0.1` 4. Upgrade Helm deployment: sh helm upgrade airflow -n airflow --set images.airflow.repository=my-dags --set images.airflow.tag=0.0.1 astronomer/airflow Extra Objects: This chart can deploy extra Kubernetes objects (assuming the role used by Helm can manage them). For Astronomer Cloud and Enterprise, the role permissions can be found in the Commander role. yaml extraObjects: - apiVersion: batch/v1beta1 kind: CronJob metadata: name: "{{ .Release.Name }}-somejob" spec: schedule: "*/10 * * * *" concurrencyPolicy: Forbid jobTemplate: spec: template: spec: containers: - name: myjob image: ubuntu command: - echo args: - hello restartPolicy: OnFailure Contributing: Check out our contributing guide! License: Apache 2.0 with Commons Clause
thinc
Thinc is a lightweight deep learning library that offers an elegant, type-checked, functional-programming API for composing models, with support for layers defined in other frameworks such as PyTorch, TensorFlow and MXNet. You can use Thinc as an interface layer, a standalone toolkit or a flexible way to develop new models.
pr-pilot
PR Pilot is an AI-powered tool designed to assist users in their daily workflow by delegating routine work to AI with confidence and predictability. It integrates seamlessly with popular development tools and allows users to interact with it through a Command-Line Interface, Python SDK, REST API, and Smart Workflows. Users can automate tasks such as generating PR titles and descriptions, summarizing and posting issues, and formatting README files. The tool aims to save time and enhance productivity by providing AI-powered solutions for common development tasks.
Large-Language-Models-play-StarCraftII
Large Language Models Play StarCraft II is a project that explores the capabilities of large language models (LLMs) in playing the game StarCraft II. The project introduces TextStarCraft II, a textual environment for the game, and a Chain of Summarization method for analyzing game information and making strategic decisions. Through experiments, the project demonstrates that LLM agents can defeat the built-in AI at a challenging difficulty level. The project provides benchmarks and a summarization approach to enhance strategic planning and interpretability in StarCraft II gameplay.
rpaframework
RPA Framework is an open-source collection of libraries and tools for Robotic Process Automation (RPA), designed to be used with Robot Framework and Python. It offers well-documented core libraries for Software Robot Developers, optimized for Robocorp Control Room and Developer Tools, and accepts external contributions. The project includes various libraries for tasks like archiving, browser automation, date/time manipulations, cloud services integration, encryption operations, database interactions, desktop automation, document processing, email operations, Excel manipulation, file system operations, FTP interactions, web API interactions, image manipulation, AI services, and more. The development of the repository is Python-based and requires Python version 3.8+, with tooling based on poetry and invoke for compiling, building, and running the package. The project is licensed under the Apache License 2.0.
StableToolBench
StableToolBench is a new benchmark developed to address the instability of Tool Learning benchmarks. It aims to balance stability and reality by introducing features like Virtual API System, Solvable Queries, and Stable Evaluation System. The benchmark ensures consistency through a caching system and API simulators, filters queries based on solvability using LLMs, and evaluates model performance using GPT-4 with metrics like Solvable Pass Rate and Solvable Win Rate.
For similar tasks
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
onnxruntime-genai
ONNX Runtime Generative AI is a library that provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. Users can call a high level `generate()` method, or run each iteration of the model in a loop. It supports greedy/beam search and TopP, TopK sampling to generate token sequences, has built in logits processing like repetition penalties, and allows for easy custom scoring.
jupyter-ai
Jupyter AI connects generative AI with Jupyter notebooks. It provides a user-friendly and powerful way to explore generative AI models in notebooks and improve your productivity in JupyterLab and the Jupyter Notebook. Specifically, Jupyter AI offers: * An `%%ai` magic that turns the Jupyter notebook into a reproducible generative AI playground. This works anywhere the IPython kernel runs (JupyterLab, Jupyter Notebook, Google Colab, Kaggle, VSCode, etc.). * A native chat UI in JupyterLab that enables you to work with generative AI as a conversational assistant. * Support for a wide range of generative model providers, including AI21, Anthropic, AWS, Cohere, Gemini, Hugging Face, NVIDIA, and OpenAI. * Local model support through GPT4All, enabling use of generative AI models on consumer grade machines with ease and privacy.
khoj
Khoj is an open-source, personal AI assistant that extends your capabilities by creating always-available AI agents. You can share your notes and documents to extend your digital brain, and your AI agents have access to the internet, allowing you to incorporate real-time information. Khoj is accessible on Desktop, Emacs, Obsidian, Web, and Whatsapp, and you can share PDF, markdown, org-mode, notion files, and GitHub repositories. You'll get fast, accurate semantic search on top of your docs, and your agents can create deeply personal images and understand your speech. Khoj is self-hostable and always will be.
langchain_dart
LangChain.dart is a Dart port of the popular LangChain Python framework created by Harrison Chase. LangChain provides a set of ready-to-use components for working with language models and a standard interface for chaining them together to formulate more advanced use cases (e.g. chatbots, Q&A with RAG, agents, summarization, extraction, etc.). The components can be grouped into a few core modules: * **Model I/O:** LangChain offers a unified API for interacting with various LLM providers (e.g. OpenAI, Google, Mistral, Ollama, etc.), allowing developers to switch between them with ease. Additionally, it provides tools for managing model inputs (prompt templates and example selectors) and parsing the resulting model outputs (output parsers). * **Retrieval:** assists in loading user data (via document loaders), transforming it (with text splitters), extracting its meaning (using embedding models), storing (in vector stores) and retrieving it (through retrievers) so that it can be used to ground the model's responses (i.e. Retrieval-Augmented Generation or RAG). * **Agents:** "bots" that leverage LLMs to make informed decisions about which available tools (such as web search, calculators, database lookup, etc.) to use to accomplish the designated task. The different components can be composed together using the LangChain Expression Language (LCEL).
danswer
Danswer is an open-source Gen-AI Chat and Unified Search tool that connects to your company's docs, apps, and people. It provides a Chat interface and plugs into any LLM of your choice. Danswer can be deployed anywhere and for any scale - on a laptop, on-premise, or to cloud. Since you own the deployment, your user data and chats are fully in your own control. Danswer is MIT licensed and designed to be modular and easily extensible. The system also comes fully ready for production usage with user authentication, role management (admin/basic users), chat persistence, and a UI for configuring Personas (AI Assistants) and their Prompts. Danswer also serves as a Unified Search across all common workplace tools such as Slack, Google Drive, Confluence, etc. By combining LLMs and team specific knowledge, Danswer becomes a subject matter expert for the team. Imagine ChatGPT if it had access to your team's unique knowledge! It enables questions such as "A customer wants feature X, is this already supported?" or "Where's the pull request for feature Y?"
infinity
Infinity is an AI-native database designed for LLM applications, providing incredibly fast full-text and vector search capabilities. It supports a wide range of data types, including vectors, full-text, and structured data, and offers a fused search feature that combines multiple embeddings and full text. Infinity is easy to use, with an intuitive Python API and a single-binary architecture that simplifies deployment. It achieves high performance, with 0.1 milliseconds query latency on million-scale vector datasets and up to 15K QPS.
For similar jobs
h2ogpt
h2oGPT is an Apache V2 open-source project that allows users to query and summarize documents or chat with local private GPT LLMs. It features a private offline database of any documents (PDFs, Excel, Word, Images, Video Frames, Youtube, Audio, Code, Text, MarkDown, etc.), a persistent database (Chroma, Weaviate, or in-memory FAISS) using accurate embeddings (instructor-large, all-MiniLM-L6-v2, etc.), and efficient use of context using instruct-tuned LLMs (no need for LangChain's few-shot approach). h2oGPT also offers parallel summarization and extraction, reaching an output of 80 tokens per second with the 13B LLaMa2 model, HYDE (Hypothetical Document Embeddings) for enhanced retrieval based upon LLM responses, a variety of models supported (LLaMa2, Mistral, Falcon, Vicuna, WizardLM. With AutoGPTQ, 4-bit/8-bit, LORA, etc.), GPU support from HF and LLaMa.cpp GGML models, and CPU support using HF, LLaMa.cpp, and GPT4ALL models. Additionally, h2oGPT provides Attention Sinks for arbitrarily long generation (LLaMa-2, Mistral, MPT, Pythia, Falcon, etc.), a UI or CLI with streaming of all models, the ability to upload and view documents through the UI (control multiple collaborative or personal collections), Vision Models LLaVa, Claude-3, Gemini-Pro-Vision, GPT-4-Vision, Image Generation Stable Diffusion (sdxl-turbo, sdxl) and PlaygroundAI (playv2), Voice STT using Whisper with streaming audio conversion, Voice TTS using MIT-Licensed Microsoft Speech T5 with multiple voices and Streaming audio conversion, Voice TTS using MPL2-Licensed TTS including Voice Cloning and Streaming audio conversion, AI Assistant Voice Control Mode for hands-free control of h2oGPT chat, Bake-off UI mode against many models at the same time, Easy Download of model artifacts and control over models like LLaMa.cpp through the UI, Authentication in the UI by user/password via Native or Google OAuth, State Preservation in the UI by user/password, Linux, Docker, macOS, and Windows support, Easy Windows Installer for Windows 10 64-bit (CPU/CUDA), Easy macOS Installer for macOS (CPU/M1/M2), Inference Servers support (oLLaMa, HF TGI server, vLLM, Gradio, ExLLaMa, Replicate, OpenAI, Azure OpenAI, Anthropic), OpenAI-compliant, Server Proxy API (h2oGPT acts as drop-in-replacement to OpenAI server), Python client API (to talk to Gradio server), JSON Mode with any model via code block extraction. Also supports MistralAI JSON mode, Claude-3 via function calling with strict Schema, OpenAI via JSON mode, and vLLM via guided_json with strict Schema, Web-Search integration with Chat and Document Q/A, Agents for Search, Document Q/A, Python Code, CSV frames (Experimental, best with OpenAI currently), Evaluate performance using reward models, and Quality maintained with over 1000 unit and integration tests taking over 4 GPU-hours.
mistral.rs
Mistral.rs is a fast LLM inference platform written in Rust. We support inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.
ollama
Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications. Ollama is designed to be easy to use and accessible to developers of all levels. It is open source and available for free on GitHub.
llama-cpp-agent
The llama-cpp-agent framework is a tool designed for easy interaction with Large Language Models (LLMs). Allowing users to chat with LLM models, execute structured function calls and get structured output (objects). It provides a simple yet robust interface and supports llama-cpp-python and OpenAI endpoints with GBNF grammar support (like the llama-cpp-python server) and the llama.cpp backend server. It works by generating a formal GGML-BNF grammar of the user defined structures and functions, which is then used by llama.cpp to generate text valid to that grammar. In contrast to most GBNF grammar generators it also supports nested objects, dictionaries, enums and lists of them.
llama_ros
This repository provides a set of ROS 2 packages to integrate llama.cpp into ROS 2. By using the llama_ros packages, you can easily incorporate the powerful optimization capabilities of llama.cpp into your ROS 2 projects by running GGUF-based LLMs and VLMs.
MITSUHA
OneReality is a virtual waifu/assistant that you can speak to through your mic and it'll speak back to you! It has many features such as: * You can speak to her with a mic * It can speak back to you * Has short-term memory and long-term memory * Can open apps * Smarter than you * Fluent in English, Japanese, Korean, and Chinese * Can control your smart home like Alexa if you set up Tuya (more info in Prerequisites) It is built with Python, Llama-cpp-python, Whisper, SpeechRecognition, PocketSphinx, VITS-fast-fine-tuning, VITS-simple-api, HyperDB, Sentence Transformers, and Tuya Cloud IoT.
wenxin-starter
WenXin-Starter is a spring-boot-starter for Baidu's "Wenxin Qianfan WENXINWORKSHOP" large model, which can help you quickly access Baidu's AI capabilities. It fully integrates the official API documentation of Wenxin Qianfan. Supports text-to-image generation, built-in dialogue memory, and supports streaming return of dialogue. Supports QPS control of a single model and supports queuing mechanism. Plugins will be added soon.
FlexFlow
FlexFlow Serve is an open-source compiler and distributed system for **low latency**, **high performance** LLM serving. FlexFlow Serve outperforms existing systems by 1.3-2.0x for single-node, multi-GPU inference and by 1.4-2.4x for multi-node, multi-GPU inference.