paddler

Stateful load balancer custom-tailored for llama.cpp 🏓🦙

Stars: 715

Visit

Paddler is an open-source load balancer and reverse proxy designed specifically for optimizing servers running llama.cpp. It overcomes typical load balancing challenges by maintaining a stateful load balancer that is aware of each server's available slots, ensuring efficient request distribution. Paddler also supports dynamic addition or removal of servers, enabling integration with autoscaling tools.

README:

Paddler

[!IMPORTANT]
Big chances! Paddler is rewritten into Rust (from Golang) and uses Pingora framework for the networking stack.

Version 1.0.0 brings some minor API changes and reporting improvements.

The next plan is to introduce a supervisor who does not just monitor llamas.cpp instances, but to also manage them (replace models without dropping requests, etc.).

Paddler is an open-source, production-ready, stateful load balancer and reverse proxy designed to optimize servers running llama.cpp.

Why Paddler

Typical load balancing strategies like round robin and least connections are ineffective for llama.cpp servers, which utilize continuous batching algorithms and allow to configure slots to handle multiple requests concurrently.

Paddler is designed to support llama.cpp-specific features like slots. It works by maintaining a stateful load balancer aware of each server's available slots, ensuring efficient request distribution.

[!NOTE] In simple terms, the slots in llama.cpp refer to predefined memory slices within the server that handle individual requests. When a request comes in, it is assigned to an available slot for processing. They are predictable and highly configurable.

You can learn more about them in llama.cpp server documentation.

Key features

Uses agents to monitor the slots of individual llama.cpp instances.
Supports the dynamic addition or removal of llama.cpp servers, enabling integration with autoscaling tools.
Buffers requests, allowing to scale from zero hosts.
Integrates with StatsD protocol but also comes with a built-in dashboard.
AWS integration.

Paddler's aware of each server's available slots, ensuring efficient request ("R") distribution

How it Works

llama.cpp instances need to be registered in Paddler. Paddler’s agents should be installed alongside llama.cpp instances so that they can report their slots status to the load balancer.

The sequence repeats for each agent:

sequenceDiagram
    participant loadbalancer as Paddler Load Balancer
    participant agent as Paddler Agent
    participant llamacpp as llama.cpp

    agent->>llamacpp: Hey, are you alive?
    llamacpp-->>agent: Yes, this is my slots status
    agent-->>loadbalancer: llama.cpp is still working
    loadbalancer->>llamacpp: I have a request for you to handle

Usage

Installation

Download the latest release for Linux, Mac, or Windows from the releases page.

On Linux, if you want Paddler to be accessible system-wide, rename the downloaded executable to /usr/bin/paddler (or /usr/local/bin/paddler).

Running llama.cpp

Slots endpoint is required to be enabled in llama.cpp. To do so, run llama.cpp with the --slots flag.

Running Agents

The next step is to run Paddler’s agents. Agents register your llama.cpp instances in Paddler and monitor the slots of llama.cpp instances. They should be installed on the same host as your server that runs llama.cpp.

An agent needs a few pieces of information:

external-llamacpp-addr tells how the load balancer can connect to the llama.cpp instance
local-llamacpp-addr tells how the agent can connect to the llama.cpp instance
management-addr tell where the agent should report the slots status

Run the following to start a Paddler’s agent (replace the hosts and ports with your own server addresses when deploying):

./paddler agent \
    --external-llamacpp-addr 127.0.0.1:8088 \
    --local-llamacpp-addr 127.0.0.1:8088 \
    --management-addr 127.0.0.1:8085

Naming the Agents

With the --name flag, you can assign each agent a custom name. This name will be displayed in the management dashboard and not used for any other purpose.

API Key

If your llama.cpp instance requires an API key, you can provide it with the --local-llamacpp-api-key flag.

Running Load Balancer

Load balancer collects data from agents and exposes reverse proxy to the outside world.

It requires two sets of flags:

management-addr tells where the load balancer should listen for updates from agents
reverseproxy-addr tells how load balancer can be reached from the outside hosts

To start the load balancer, run:

./paddler balancer \
    --management-addr 127.0.0.1:8085 \
    --reverseproxy-addr 196.168.2.10:8080

management-host and management-port in agents should be the same as in the load balancer.

Enabling Dashboard

You can enable dashboard to see the status of the agents with --management-dashboard-enable flag. If enabled, it is available at the management server address under /dashboard path.

Enabling Slots Endpoint

[!NOTE] Available since v1.0.0

By default, Paddler blocks access to /slots endpoint, even if it is enabled in llama.cpp, because it exposes a lot of sensistive information about the server, and should only be used internally. If you want to expose it anyway, you can use the --slots-endpoint-enable flag.

Rewriting the `Host` Header

[!NOTE] Available since v0.8.0

In some cases (see: #20), you might want to rewrite the Host header.

In such cases, you can use the --rewrite-host-header flag. If used, Paddler will use the external host provided by agents instead of the balancer host when forwarding the requests.

Feature Highlights

Aggregated Health Status

Paddler balancer endpoint aggregates the slots of all llama.cpp instances and reports the total number of available and processing slots.

Aggregated health status is available at the /api/v1/agents endpoint of the management server.

Buffered Requests (Scaling from Zero Hosts)

[!NOTE] Available since v0.3.0

Load balancer's buffered requests allow your infrastructure to scale from zero hosts by providing an additional metric (unhandled requests).

It also gives your infrastructure some additional time to add additional hosts. For example, if your autoscaler is setting up an additional server, putting an incoming request on hold for 60 seconds might give it a chance to be handled even though there might be no available llama.cpp instances at the moment of issuing it.

Scaling from zero hosts is especially suitable for low-traffic projects because it allows you to cut costs on your infrastructure—you won't be paying your cloud provider anything if you are not using your service at the moment.

https://github.com/distantmagic/paddler/assets/1286785/34b93e4c-0746-4eed-8be3-cd698e15cbf9

State Dashboard

Although Paddler integrates with the StatsD protocol, you can preview the cluster's state using a built-in dashboard.

Web Dashboard

Paddler needs to be compiled with the web_dashboard feature flag enabled (enabled by default in GitHub releases).

To start the dashboard, run paddler balancer with the --management-dashboard-enable flag.

TUI Dashobard

[!NOTE] Available since v1.2.0

You can connect to any running Paddler instance with paddler dashboard --management-addr [HOST]:[PORT].

Thank you @Propfend for contributing the TUI Dashboard!

StatsD Metrics

[!NOTE] Available since v0.3.0

[!TIP] If you keep your stack self-hosted you can use Prometheus with StatsD exporter to handle the incoming metrics.

[!TIP] This feature works with AWS CloudWatch Agent as well.

Paddler supports the following StatsD metrics:

requests_buffered number of buffered requests since the last report (resets after each report)
slots_idle total idle slots
slots_processing total slots processing requests

All of them use gauge internally.

StatsD metrics need to be enabled with the following flags:

./paddler balancer \
    # .. put all the other flags here ...
    --statsd-addr=127.0.0.1:8125

If you do not provide the --statsd-addr flag, the StatsD metrics will not be collected.

Tutorials

Changelog

v1.2.0

Features

Add TUI dashboard (paddler dashboard --management-addr [HOST]:[PORT]) to be able to easily observe balancer instances from the terminal level

v1.1.0

More meaningful error messages when the agent can't connect to the llama.cpp slot endpoint, or when slot endpoint is not enabled in llama.cpp
Set default logging level to info for agents and balancer to increase the amount of information in the logs (it wasn't clean if the agent was running or not)
Enable LTO optimization for the release builds (see #28)

v1.0.0

The first stable release! Paddler is now rewritten in Rust and uses the Pingora framework for the networking stack. A few minor API changes and reporting improvements are introduced (documented in the README). API and configuration are now stable, and won't be changed until version 2.0.0.

This is a stability/quality release. The next plan is to introduce a supervisor who does not just monitor llama.cpp instances, but to also manage them.

Requires llama.cpp version b4027 or above.

v0.10.0

This update is a minor release to make Paddler compatible with /slots endpoint changes introduced in llama.cpp b4027.

Requires llama.cpp version b4027 or above.

v0.9.0

Latest supported llama.cpp release: b4026

Features

Add --local-llamacpp-api-key flag to balancer to support llama.cpp API keys (see: #23)

v0.8.0

Features

Add --rewrite-host-header flag to balancer to rewrite the Host header in forwarded requests (see: #20)

v0.7.1

Fixes

Incorrect preemptive counting of remaining slots in some scenarios

v0.7.0

Requires at least b3606 llama.cpp release.

Breaking Changes

Adjusted to handle breaking changes in llama.cpp /health endpoint: https://github.com/ggerganov/llama.cpp/pull/9056

Instead of using the /health endpoint to monitor slot statuses, starting from this version, Paddler uses the /slots endpoint to monitor llama.cpp instances. Paddler's /health endpoint remains unchanged.

v0.6.0

Latest supported llama.cpp release: b3604

Features

Name agents with --name flag

v0.5.0

Fixes

Management server crashed in some scenarios due to concurrency issues

v0.4.0

Thank you, @ScottMcNaught, for the help with debugging the issues! :)

Fixes

OpenAI compatible endpoint is now properly balanced (/v1/chat/completions)
Balancer's reverse proxy panicked in some scenarios when the underlying llama.cpp instance was abruptly closed during the generation of completion tokens
Added mutex in the targets collection for better internal slots data integrity

v0.3.0

Features

Requests can queue when all llama.cpp instances are busy
AWS Metadata support for agent local IP address
StatsD metrics support

v0.1.0

Features

Aggregated Health Status Responses

Why the Name

I initially wanted to use Raft consensus algorithm (thus Paddler, because it paddles on a Raft), but eventually, I dropped that idea. The name stayed, though.

Later, people started sending me a "that's a paddlin'" clip from The Simpsons, and I just embraced it.

Community

Discord: https://discord.gg/kysUzFqSCK

For Tasks:

Click tags to check more tools for each tasks

manage server load optimize request distribution monitor server health integrate with autoscaling install and configure agents

For Jobs:

devops engineer site reliability engineer cloud infrastructure engineer systems administrator network engineer

Alternative AI tools for paddler

Similar Open Source Tools

paddler

github

: 715

ygo-agent

YGO Agent is a project focused on using deep learning to master the Yu-Gi-Oh! trading card game. It utilizes reinforcement learning and large language models to develop advanced AI agents that aim to surpass human expert play. The project provides a platform for researchers and players to explore AI in complex, strategic game environments.

github

: 55

CLI

Bito CLI provides a command line interface to the Bito AI chat functionality, allowing users to interact with the AI through commands. It supports complex automation and workflows, with features like long prompts and slash commands. Users can install Bito CLI on Mac, Linux, and Windows systems using various methods. The tool also offers configuration options for AI model type, access key management, and output language customization. Bito CLI is designed to enhance user experience in querying AI models and automating tasks through the command line interface.

github

: 546

CoML

CoML (formerly MLCopilot) is an interactive coding assistant for data scientists and machine learning developers, empowered on large language models. It offers an out-of-the-box interactive natural language programming interface for data mining and machine learning tasks, integration with Jupyter lab and Jupyter notebook, and a built-in large knowledge base of machine learning to enhance the ability to solve complex tasks. The tool is designed to assist users in coding tasks related to data analysis and machine learning using natural language commands within Jupyter environments.

github

: 57

python-sc2

python-sc2 is an easy-to-use library for writing AI Bots for StarCraft II in Python 3. It aims for simplicity and ease of use while providing both high and low level abstractions. The library covers only the raw scripted interface and intends to help new bot authors with added functions. Users can install the library using pip and need a StarCraft II executable to run bots. The API configuration options allow users to customize bot behavior and performance. The community provides support through Discord servers, and users can contribute to the project by creating new issues or pull requests following style guidelines.

github

: 480

aider-composer

Aider Composer is a VSCode extension that integrates Aider into your development workflow. It allows users to easily add and remove files, toggle between read-only and editable modes, review code changes, use different chat modes, and reference files in the chat. The extension supports multiple models, code generation, code snippets, and settings customization. It has limitations such as lack of support for multiple workspaces, Git repository features, linting, testing, voice features, in-chat commands, and configuration options.

github

: 362

UltraSinger

UltraSinger is a tool under development that automatically creates UltraStar.txt, midi, and notes from music. It pitches UltraStar files, adds text and tapping, creates separate UltraStar karaoke files, re-pitches current UltraStar files, and calculates in-game score. It uses multiple AI models to extract text from voice and determine pitch. Users should mention UltraSinger in UltraStar.txt files and only use it on Creative Commons licensed songs.

github

: 305

telemetry-airflow

This repository codifies the Airflow cluster that is deployed at workflow.telemetry.mozilla.org (behind SSO) and commonly referred to as "WTMO" or simply "Airflow". Some links relevant to users and developers of WTMO: * The `dags` directory in this repository contains some custom DAG definitions * Many of the DAGs registered with WTMO don't live in this repository, but are instead generated from ETL task definitions in bigquery-etl * The Data SRE team maintains a WTMO Developer Guide (behind SSO)

github

: 185

vectara-answer

Vectara Answer is a sample app for Vectara-powered Summarized Semantic Search (or question-answering) with advanced configuration options. For examples of what you can build with Vectara Answer, check out Ask News, LegalAid, or any of the other demo applications.

github

: 249

0chain

Züs is a high-performance cloud on a fast blockchain offering privacy and configurable uptime. It uses erasure code to distribute data between data and parity servers, allowing flexibility for IT managers to design for security and uptime. Users can easily share encrypted data with business partners through a proxy key sharing protocol. The ecosystem includes apps like Blimp for cloud migration, Vult for personal cloud storage, and Chalk for NFT artists. Other apps include Bolt for secure wallet and staking, Atlus for blockchain explorer, and Chimney for network participation. The QoS protocol challenges providers based on response time, while the privacy protocol enables secure data sharing. Züs supports hybrid and multi-cloud architectures, allowing users to improve regulatory compliance and security requirements.

github

: 113

llm-random

This repository contains code for research conducted by the LLM-Random research group at IDEAS NCBR in Warsaw, Poland. The group focuses on developing and using this repository to conduct research. For more information about the group and its research, refer to their blog, llm-random.github.io.

github

: 181

aioli

Aioli is a library for running genomics command-line tools in the browser using WebAssembly. It creates a single WebWorker to run all WebAssembly tools, shares a filesystem across modules, and efficiently mounts local files. The tool encapsulates each module for loading, does WebAssembly feature detection, and communicates with the WebWorker using the Comlink library. Users can deploy new releases and versions, and benefit from code reuse by porting existing C/C++/Rust/etc tools to WebAssembly for browser use.

github

: 100

reai-ghidra

The RevEng.AI Ghidra Plugin by RevEng.ai allows users to interact with their API within Ghidra for Binary Code Similarity analysis to aid in Reverse Engineering stripped binaries. Users can upload binaries, rename functions above a confidence threshold, and view similar functions for a selected function.

github

: 94

comp

Comp AI is an open-source compliance automation platform designed to assist companies in achieving compliance with standards like SOC 2, ISO 27001, and GDPR. It transforms compliance into an engineering problem solved through code, automating evidence collection, policy management, and control implementation while maintaining data and infrastructure control.

github

: 509

ShellOracle

ShellOracle is an innovative terminal utility designed for intelligent shell command generation, bringing a new level of efficiency to your command-line interactions. It supports seamless shell command generation from written descriptions, command history for easy reference, Unix pipe support for advanced command chaining, self-hosted for full control over your environment, and highly configurable to adapt to your preferences. It can be easily installed using pipx, upgraded with simple commands, and used as a BASH/ZSH widget activated by the CTRL+F keyboard shortcut. ShellOracle can also be run as a Python module or using its entrypoint 'shor'. The tool supports providers like Ollama, OpenAI, and LocalAI, with detailed instructions for each provider. Configuration options are available to customize the utility according to user preferences and requirements. ShellOracle is compatible with BASH and ZSH on macOS and Linux, with no specific hardware requirements for cloud providers like OpenAI.

github

: 121

LLMFlex

LLMFlex is a python package designed for developing AI applications with local Large Language Models (LLMs). It provides classes to load LLM models, embedding models, and vector databases to create AI-powered solutions with prompt engineering and RAG techniques. The package supports multiple LLMs with different generation configurations, embedding toolkits, vector databases, chat memories, prompt templates, custom tools, and a chatbot frontend interface. Users can easily create LLMs, load embeddings toolkit, use tools, chat with models in a Streamlit web app, and serve an OpenAI API with a GGUF model. LLMFlex aims to offer a simple interface for developers to work with LLMs and build private AI solutions using local resources.

github

: 94

For similar tasks

paddler

github

: 715

For similar jobs

flux-aio

Flux All-In-One is a lightweight distribution optimized for running the GitOps Toolkit controllers as a single deployable unit on Kubernetes clusters. It is designed for bare clusters, edge clusters, clusters with restricted communication, clusters with egress via proxies, and serverless clusters. The distribution follows semver versioning and provides documentation for specifications, installation, upgrade, OCI sync configuration, Git sync configuration, and multi-tenancy configuration. Users can deploy Flux using Timoni CLI and a Timoni Bundle file, fine-tune installation options, sync from public Git repositories, bootstrap repositories, and uninstall Flux without affecting reconciled workloads.

github

: 111

paddler

github

: 715

DaoCloud-docs

DaoCloud Enterprise 5.0 Documentation provides detailed information on using DaoCloud, a Certified Kubernetes Service Provider. The documentation covers current and legacy versions, workflow control using GitOps, and instructions for opening a PR and previewing changes locally. It also includes naming conventions, writing tips, references, and acknowledgments to contributors. Users can find guidelines on writing, contributing, and translating pages, along with using tools like MkDocs, Docker, and Poetry for managing the documentation.

github

: 194

ztncui-aio

This repository contains a Docker image with ZeroTier One and ztncui to set up a standalone ZeroTier network controller with a web user interface. It provides features like Golang auto-mkworld for generating a planet file, supports local persistent storage configuration, and includes a public file server. Users can build the Docker image, set up the container with specific environment variables, and manage the ZeroTier network controller through the web interface.

github

: 166

devops-gpt

DevOpsGPT is a revolutionary tool designed to streamline your workflow and empower you to build systems and automate tasks with ease. Tired of spending hours on repetitive DevOps tasks? DevOpsGPT is here to help! Whether you're setting up infrastructure, speeding up deployments, or tackling any other DevOps challenge, our app can make your life easier and more productive. With DevOpsGPT, you can expect faster task completion, simplified workflows, and increased efficiency. Ready to experience the DevOpsGPT difference? Visit our website, sign in or create an account, start exploring the features, and share your feedback to help us improve. DevOpsGPT will become an essential tool in your DevOps toolkit.

github

: 52

ChatOpsLLM

ChatOpsLLM is a project designed to empower chatbots with effortless DevOps capabilities. It provides an intuitive interface and streamlined workflows for managing and scaling language models. The project incorporates robust MLOps practices, including CI/CD pipelines with Jenkins and Ansible, monitoring with Prometheus and Grafana, and centralized logging with the ELK stack. Developers can find detailed documentation and instructions on the project's website.

github

: 87

aiops-modules

AIOps Modules is a collection of reusable Infrastructure as Code (IAC) modules that work with SeedFarmer CLI. The modules are decoupled and can be aggregated using GitOps principles to achieve desired use cases, removing heavy lifting for end users. They must be generic for reuse in Machine Learning and Foundation Model Operations domain, adhering to SeedFarmer Guide structure. The repository includes deployment steps, project manifests, and various modules for SageMaker, Mlflow, FMOps/LLMOps, MWAA, Step Functions, EKS, and example use cases. It also supports Industry Data Framework (IDF) and Autonomous Driving Data Framework (ADDF) Modules.

github

: 72

3FS

The Fire-Flyer File System (3FS) is a high-performance distributed file system designed for AI training and inference workloads. It leverages modern SSDs and RDMA networks to provide a shared storage layer that simplifies development of distributed applications. Key features include performance, disaggregated architecture, strong consistency, file interfaces, data preparation, dataloaders, checkpointing, and KVCache for inference. The system is well-documented with design notes, setup guide, USRBIO API reference, and P specifications. Performance metrics include peak throughput, GraySort benchmark results, and KVCache optimization. The source code is available on GitHub for cloning and installation of dependencies. Users can build 3FS and run test clusters following the provided instructions. Issues can be reported on the GitHub repository.

github

: 8.2k

paddler

README:

Paddler

Why Paddler

Key features

How it Works

Usage

Installation

Running llama.cpp

Running Agents

Naming the Agents

API Key

Running Load Balancer

Enabling Dashboard

Enabling Slots Endpoint

Rewriting the Host Header

Feature Highlights

Aggregated Health Status

Buffered Requests (Scaling from Zero Hosts)

State Dashboard

Web Dashboard

TUI Dashobard

StatsD Metrics

Tutorials

Changelog

v1.2.0

Features

v1.1.0

v1.0.0

v0.10.0

v0.9.0

Features

v0.8.0

Features

v0.7.1

Fixes

v0.7.0

Breaking Changes

v0.6.0

Features

v0.5.0

Fixes

v0.4.0

Fixes

v0.3.0

Features

v0.1.0

Features

Why the Name

Community

For Tasks:

For Jobs:

Alternative AI tools for paddler

Similar Open Source Tools

paddler

ygo-agent

CLI

CoML

python-sc2

aider-composer

UltraSinger

telemetry-airflow

vectara-answer

0chain

llm-random

aioli

reai-ghidra

comp

ShellOracle

LLMFlex

For similar tasks

paddler

For similar jobs

flux-aio

paddler

DaoCloud-docs

ztncui-aio

devops-gpt

ChatOpsLLM

aiops-modules

Rewriting the `Host` Header