paddler
Stateful load balancer custom-tailored for llama.cpp
Stars: 525
Paddler is an open-source load balancer and reverse proxy designed specifically for optimizing servers running llama.cpp. It overcomes typical load balancing challenges by maintaining a stateful load balancer that is aware of each server's available slots, ensuring efficient request distribution. Paddler also supports dynamic addition or removal of servers, enabling integration with autoscaling tools.
README:
Paddler is an open-source, production-ready, stateful load balancer and reverse proxy designed to optimize servers running llama.cpp.
Typical load balancing strategies like round robin and least connections are ineffective for llama.cpp servers, which utilize continuous batching algorithms and allow to configure slots to handle multiple requests concurrently.
Paddler is designed to support llama.cpp-specific features like slots. It works by maintaining a stateful load balancer aware of each server's available slots, ensuring efficient request distribution.
[!NOTE] In simple terms, the
slots
in llama.cpp refer to predefined memory slices within the server that handle individual requests. When a request comes in, it is assigned to an available slot for processing. They are predictable and highly configurable.You can learn more about them in llama.cpp server documentation.
- Uses agents to monitor the slots of individual llama.cpp instances.
- Supports the dynamic addition or removal of llama.cpp servers, enabling integration with autoscaling tools.
- Buffers requests, allowing to scale from zero hosts.
- Integrates with StatsD protocol but also comes with a built-in dashboard.
- AWS integration.
Paddler's aware of each server's available slots, ensuring efficient request ("R") distribution
llama.cpp instances need to be registered in Paddler. Paddler’s agents should be installed alongside llama.cpp instances so that they can report their slots status to the load balancer.
The sequence repeats for each agent:
sequenceDiagram
participant loadbalancer as Paddler Load Balancer
participant agent as Paddler Agent
participant llamacpp as llama.cpp
agent->>llamacpp: Hey, are you alive?
llamacpp-->>agent: Yes, this is my slots status
agent-->>loadbalancer: llama.cpp is still working
loadbalancer->>llamacpp: I have a request for you to handle
Download the latest release for Linux, Mac, or Windows from the releases page.
On Linux, if you want Paddler to be accessible system-wide, rename the downloaded executable to /usr/bin/paddler
(or /usr/local/bin/paddler
).
The next step is to run Paddler’s agents. Agents register your llama.cpp instances in Paddler and monitor the slots of llama.cpp instances. They should be installed on the same host as your server that runs llama.cpp.
An agent needs a few pieces of information:
-
external-*
tells how the load balancer can connect to the llama.cpp instance -
local-*
tells how the agent can connect to the llama.cpp instance -
management-*
tell where the agent should report the slots status
Run the following to start a Paddler’s agent (replace the hosts and ports with your own server addresses when deploying):
./paddler agent \
--external-llamacpp-host 127.0.0.1 \
--external-llamacpp-port 8088 \
--local-llamacpp-host 127.0.0.1 \
--local-llamacpp-port 8088 \
--management-host 127.0.0.1 \
--management-port 8085
[!NOTE] Available since v0.6.0
With the --name
flag, you can assign each agent a custom name. This name will be displayed in the management dashboard and not used for any other purpose.
Load balancer collects data from agents and exposes reverse proxy to the outside world.
It requires two sets of flags:
-
management-*
tells where the load balancer should listen for updates from agents -
reverseproxy-*
tells how load balancer can be reached from the outside hosts
To start the load balancer, run:
./paddler balancer \
--management-host 127.0.0.1 \
--management-port 8085 \
--reverseproxy-host 196.168.2.10 \
--reverseproxy-port 8080
management-host
and management-port
in agents should be the same as in the load balancer.
You can enable dashboard to see the status of the agents with
--management-dashboard-enable=true
flag. If enabled, it is available at the
management server address under /dashboard
path.
[!NOTE] Available since v0.8.0
In some cases (see: #20), you might want to rewrite the Host
header.
In such cases, you can use the --rewrite-host-header
flag. If used, Paddler will use the external
host provided by agents instead of the balancer host when forwarding the requests.
Paddler balancer endpoint aggregates the /health
endpoints of llama.cpp
and reports the total number of available and processing slots.
[!NOTE] Available since v0.3.0
Load balancer's buffered requests allow your infrastructure to scale from zero hosts by providing an additional metric (requests waiting to be handled).
It also gives your infrastructure some additional time to add additional hosts. For example, if your autoscaler is setting up an additional server, putting an incoming request on hold for 60 seconds might give it a chance to be handled even though there might be no available llama.cpp instances at the moment of issuing it.
Scaling from zero hosts is especially suitable for low-traffic projects because it allows you to cut costs on your infrastructure—you won't be paying your cloud provider anything if you are not using your service at the moment.
https://github.com/distantmagic/paddler/assets/1286785/34b93e4c-0746-4eed-8be3-cd698e15cbf9
Although Paddler integrates with the StatsD protocol, you can preview the cluster's state using a built-in dashboard.
[!NOTE] Available since v0.3.0
[!TIP] If you keep your stack self-hosted you can use Prometheus with StatsD exporter to handle the incoming metrics.
[!TIP] This feature works with AWS CloudWatch Agent as well.
Paddler supports the following StatsD metrics:
-
paddler.requests_buffered
number of buffered requests since the last report (resets after each report) -
paddler.slots_idle
total idle slots -
paddler.slots_processing
total slots processing requests
All of them use gauge
internally.
StatsD metrics need to be enabled with the following flags:
./paddler balancer \
# .. put all the other flags here ...
--statsd-enable=true \
--statsd-host=127.0.0.1 \
--statsd-port=8125 \
--statsd-scheme=http
[!NOTE] Available since v0.3.0
When running on AWS EC2, you can replace --local-llamacpp-host
with aws:metadata:local-ipv4
. In that case, Paddler will use EC2 instance metadata to fetch the local IP address (from the local network):
If you want to keep the balancer management address predictable, I recommend using Route 53 to create a record that always points to your load balancer (for example paddler_balancer.example.com
), which makes it something like that in the end:
./paddler agent \
--external-llamacpp-host aws:metadata:local-ipv4 \
--external-llamacpp-port 8088 \
--local-llamacpp-host 127.0.0.1 \
--local-llamacpp-port 8088 \
--management-host paddler_balancer.example.com \
--management-port 8085
- Add
--rewrite-host-header
flag to balancer to rewrite theHost
header in forwarded requests (see: #20)
- Incorrect preemptive counting of remaining slots in some scenarios
Requires at least b3606 llama.cpp release.
-
Adjusted to handle breaking changes in llama.cpp
/health
endpoint: https://github.com/ggerganov/llama.cpp/pull/9056Instead of using the
/health
endpoint to monitor slot statuses, starting from this version, Paddler uses the/slots
endpoint to monitor llama.cpp instances. Paddler's/health
endpoint remains unchanged.
Latest supported llama.cpp release: b3604
- Management server crashed in some scenarios due to concurrency issues
Thank you, @ScottMcNaught, for the help with debugging the issues! :)
- OpenAI compatible endpoint is now properly balanced (
/v1/chat/completions
) - Balancer's reverse proxy
panic
ked in some scenarios when the underlyingllama.cpp
instance was abruptly closed during the generation of completion tokens - Added mutex in the targets collection for better internal slots data integrity
- Requests can queue when all llama.cpp instances are busy
- AWS Metadata support for agent local IP address
- StatsD metrics support
I initially wanted to use Raft consensus algorithm (thus Paddler, because it paddles on a Raft), but eventually, I dropped that idea. The name stayed, though.
Later, people started sending me a "that's a paddlin'" clip from The Simpsons, and I just embraced it.
Discord: https://discord.gg/kysUzFqSCK
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for paddler
Similar Open Source Tools
paddler
Paddler is an open-source load balancer and reverse proxy designed specifically for optimizing servers running llama.cpp. It overcomes typical load balancing challenges by maintaining a stateful load balancer that is aware of each server's available slots, ensuring efficient request distribution. Paddler also supports dynamic addition or removal of servers, enabling integration with autoscaling tools.
ygo-agent
YGO Agent is a project focused on using deep learning to master the Yu-Gi-Oh! trading card game. It utilizes reinforcement learning and large language models to develop advanced AI agents that aim to surpass human expert play. The project provides a platform for researchers and players to explore AI in complex, strategic game environments.
telemetry-airflow
This repository codifies the Airflow cluster that is deployed at workflow.telemetry.mozilla.org (behind SSO) and commonly referred to as "WTMO" or simply "Airflow". Some links relevant to users and developers of WTMO: * The `dags` directory in this repository contains some custom DAG definitions * Many of the DAGs registered with WTMO don't live in this repository, but are instead generated from ETL task definitions in bigquery-etl * The Data SRE team maintains a WTMO Developer Guide (behind SSO)
AI-Scientist
The AI Scientist is a comprehensive system for fully automatic scientific discovery, enabling Foundation Models to perform research independently. It aims to tackle the grand challenge of developing agents capable of conducting scientific research and discovering new knowledge. The tool generates papers on various topics using Large Language Models (LLMs) and provides a platform for exploring new research ideas. Users can create their own templates for specific areas of study and run experiments to generate papers. However, caution is advised as the codebase executes LLM-written code, which may pose risks such as the use of potentially dangerous packages and web access.
CoML
CoML (formerly MLCopilot) is an interactive coding assistant for data scientists and machine learning developers, empowered on large language models. It offers an out-of-the-box interactive natural language programming interface for data mining and machine learning tasks, integration with Jupyter lab and Jupyter notebook, and a built-in large knowledge base of machine learning to enhance the ability to solve complex tasks. The tool is designed to assist users in coding tasks related to data analysis and machine learning using natural language commands within Jupyter environments.
python-sc2
python-sc2 is an easy-to-use library for writing AI Bots for StarCraft II in Python 3. It aims for simplicity and ease of use while providing both high and low level abstractions. The library covers only the raw scripted interface and intends to help new bot authors with added functions. Users can install the library using pip and need a StarCraft II executable to run bots. The API configuration options allow users to customize bot behavior and performance. The community provides support through Discord servers, and users can contribute to the project by creating new issues or pull requests following style guidelines.
talemate
Talemate is a roleplay tool that allows users to interact with AI agents for dialogue, narration, summarization, direction, editing, world state management, character/scenario creation, text-to-speech, and visual generation. It supports multiple AI clients and APIs, offers long-term memory using ChromaDB, and provides tools for managing NPCs, AI-assisted character creation, and scenario creation. Users can customize prompts using Jinja2 templates and benefit from a modern, responsive UI. The tool also integrates with Runpod for enhanced functionality.
0chain
Züs is a high-performance cloud on a fast blockchain offering privacy and configurable uptime. It uses erasure code to distribute data between data and parity servers, allowing flexibility for IT managers to design for security and uptime. Users can easily share encrypted data with business partners through a proxy key sharing protocol. The ecosystem includes apps like Blimp for cloud migration, Vult for personal cloud storage, and Chalk for NFT artists. Other apps include Bolt for secure wallet and staking, Atlus for blockchain explorer, and Chimney for network participation. The QoS protocol challenges providers based on response time, while the privacy protocol enables secure data sharing. Züs supports hybrid and multi-cloud architectures, allowing users to improve regulatory compliance and security requirements.
aioli
Aioli is a library for running genomics command-line tools in the browser using WebAssembly. It creates a single WebWorker to run all WebAssembly tools, shares a filesystem across modules, and efficiently mounts local files. The tool encapsulates each module for loading, does WebAssembly feature detection, and communicates with the WebWorker using the Comlink library. Users can deploy new releases and versions, and benefit from code reuse by porting existing C/C++/Rust/etc tools to WebAssembly for browser use.
LLMFlex
LLMFlex is a python package designed for developing AI applications with local Large Language Models (LLMs). It provides classes to load LLM models, embedding models, and vector databases to create AI-powered solutions with prompt engineering and RAG techniques. The package supports multiple LLMs with different generation configurations, embedding toolkits, vector databases, chat memories, prompt templates, custom tools, and a chatbot frontend interface. Users can easily create LLMs, load embeddings toolkit, use tools, chat with models in a Streamlit web app, and serve an OpenAI API with a GGUF model. LLMFlex aims to offer a simple interface for developers to work with LLMs and build private AI solutions using local resources.
markdowner
Markdowner is a fast tool designed to convert any website into LLM-ready markdown data. It aims to improve the quality of responses in the AI app Supermemory by structuring and predicting data in markdown format. The tool offers features such as website conversion, LLM filtering, detailed markdown mode, auto crawler, text and JSON responses, and easy self-hosting. Markdowner utilizes Cloudflare's Browser rendering and Durable objects for browser instance creation and markdown conversion. Users can self-host the project with the Workers paid plan, following simple steps. Support the project by starring the repository.
reader
Reader is a tool that converts any URL to an LLM-friendly input with a simple prefix `https://r.jina.ai/`. It improves the output for your agent and RAG systems at no cost. Reader supports image reading, captioning all images at the specified URL and adding `Image [idx]: [caption]` as an alt tag. This enables downstream LLMs to interact with the images in reasoning, summarizing, etc. Reader offers a streaming mode, useful when the standard mode provides an incomplete result. In streaming mode, Reader waits a bit longer until the page is fully rendered, providing more complete information. Reader also supports a JSON mode, which contains three fields: `url`, `title`, and `content`. Reader is backed by Jina AI and licensed under Apache-2.0.
vectara-answer
Vectara Answer is a sample app for Vectara-powered Summarized Semantic Search (or question-answering) with advanced configuration options. For examples of what you can build with Vectara Answer, check out Ask News, LegalAid, or any of the other demo applications.
CLI
Bito CLI provides a command line interface to the Bito AI chat functionality, allowing users to interact with the AI through commands. It supports complex automation and workflows, with features like long prompts and slash commands. Users can install Bito CLI on Mac, Linux, and Windows systems using various methods. The tool also offers configuration options for AI model type, access key management, and output language customization. Bito CLI is designed to enhance user experience in querying AI models and automating tasks through the command line interface.
srcbook
Srcbook is an open-source interactive programming environment for TypeScript that allows users to create, run, and share reproducible programs and ideas. It features AI capabilities for exploring and iterating on ideas, supports exporting to valid markdown format, and enables diagraming with mermaid for rich annotations. Users can locally execute programs through a web interface, powered by Node.js under the Apache2 license.
h2o-llmstudio
H2O LLM Studio is a framework and no-code GUI designed for fine-tuning state-of-the-art large language models (LLMs). With H2O LLM Studio, you can easily and effectively fine-tune LLMs without the need for any coding experience. The GUI is specially designed for large language models, and you can finetune any LLM using a large variety of hyperparameters. You can also use recent finetuning techniques such as Low-Rank Adaptation (LoRA) and 8-bit model training with a low memory footprint. Additionally, you can use Reinforcement Learning (RL) to finetune your model (experimental), use advanced evaluation metrics to judge generated answers by the model, track and compare your model performance visually, and easily export your model to the Hugging Face Hub and share it with the community.
For similar tasks
paddler
Paddler is an open-source load balancer and reverse proxy designed specifically for optimizing servers running llama.cpp. It overcomes typical load balancing challenges by maintaining a stateful load balancer that is aware of each server's available slots, ensuring efficient request distribution. Paddler also supports dynamic addition or removal of servers, enabling integration with autoscaling tools.
For similar jobs
flux-aio
Flux All-In-One is a lightweight distribution optimized for running the GitOps Toolkit controllers as a single deployable unit on Kubernetes clusters. It is designed for bare clusters, edge clusters, clusters with restricted communication, clusters with egress via proxies, and serverless clusters. The distribution follows semver versioning and provides documentation for specifications, installation, upgrade, OCI sync configuration, Git sync configuration, and multi-tenancy configuration. Users can deploy Flux using Timoni CLI and a Timoni Bundle file, fine-tune installation options, sync from public Git repositories, bootstrap repositories, and uninstall Flux without affecting reconciled workloads.
paddler
Paddler is an open-source load balancer and reverse proxy designed specifically for optimizing servers running llama.cpp. It overcomes typical load balancing challenges by maintaining a stateful load balancer that is aware of each server's available slots, ensuring efficient request distribution. Paddler also supports dynamic addition or removal of servers, enabling integration with autoscaling tools.
DaoCloud-docs
DaoCloud Enterprise 5.0 Documentation provides detailed information on using DaoCloud, a Certified Kubernetes Service Provider. The documentation covers current and legacy versions, workflow control using GitOps, and instructions for opening a PR and previewing changes locally. It also includes naming conventions, writing tips, references, and acknowledgments to contributors. Users can find guidelines on writing, contributing, and translating pages, along with using tools like MkDocs, Docker, and Poetry for managing the documentation.
ztncui-aio
This repository contains a Docker image with ZeroTier One and ztncui to set up a standalone ZeroTier network controller with a web user interface. It provides features like Golang auto-mkworld for generating a planet file, supports local persistent storage configuration, and includes a public file server. Users can build the Docker image, set up the container with specific environment variables, and manage the ZeroTier network controller through the web interface.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
ai-on-gke
This repository contains assets related to AI/ML workloads on Google Kubernetes Engine (GKE). Run optimized AI/ML workloads with Google Kubernetes Engine (GKE) platform orchestration capabilities. A robust AI/ML platform considers the following layers: Infrastructure orchestration that support GPUs and TPUs for training and serving workloads at scale Flexible integration with distributed computing and data processing frameworks Support for multiple teams on the same infrastructure to maximize utilization of resources
tidb
TiDB is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability.
nvidia_gpu_exporter
Nvidia GPU exporter for prometheus, using `nvidia-smi` binary to gather metrics.