
aiconfigurator
Offline optimization of your disaggregated Dynamo graph
Stars: 68

The `aiconfigurator` tool assists in finding a strong starting configuration for disaggregated serving in AI deployments. It helps optimize throughput at a given latency by evaluating thousands of configurations based on model, GPU count, and GPU type. The tool models LLM inference using collected data for a target machine and framework, running via CLI and web app. It generates configuration files for deployment with Dynamo, offering features like customized configuration, all-in-one automation, and tuning with advanced features. The tool estimates performance by breaking down LLM inference into operations, collecting operation execution times, and searching for strong configurations. Supported features include models like GPT and operations like attention, KV cache, GEMM, AllReduce, embedding, P2P, element-wise, MoE, MLA BMM, TRTLLM versions, and parallel modes like tensor-parallel and pipeline-parallel.
README:
In disaggregated serving, configuring an effective deployment is challenging: you need to decide how many prefill and decode workers to run, and the parallelism for each worker. Combined with SLA targets for TTFT (Time to First Token) and TPOT (Time per Output Token), optimizing throughput at a given latency becomes even more complex.
aiconfigurator
helps you find a strong starting configuration for disaggregated serving. Given your model, GPU
count, and GPU type, it searches the configuration space and generates configuration files you can use for deployment with Dynamo.
The tool models LLM inference using collected data for a target machine and framework. It evaluates thousands of configurations and runs anywhere via the CLI and the web app.
Let's get started.
pip3 install aiconfigurator
# 1. Install Git LFS
apt-get install git-lfs # (Linux)
# brew install git-lfs # (macOS)
# 2. Clone the repo
git clone https://github.com/ai-dynamo/aiconfigurator.git
# 3. Create and activate a virtual environment
python3 -m venv myenv && source myenv/bin/activate # (requires Python 3.9 or later)
# 4. Install aiconfigurator
pip3 install .
# This will create a ./dist/ folder containing the wheel file
docker build -f docker/Dockerfile --no-cache --target build -t aiconfigurator:latest .
docker create --name aic aiconfigurator:latest && docker cp aic:/workspace/dist dist/ && docker rm aic
aiconfigurator cli --model QWEN3_32B --total_gpus 32 --system h200_sxm
- With three basic arguments, it prints the estimated best deployment and the deployment details.
- Use
--save_dir DIR
to generate framework configuration files for Dynamo. - Use
-h
for more options and customization.
********************************************************************************
* Dynamo AIConfigurator Final Results *
********************************************************************************
----------------------------------------------------------------------------
Input Configuration & SLA Target:
Model: QWEN3_32B (is_moe: False)
Total GPUs: 512
I/O Length (tokens): Input=4000, Output=500
SLA Target: TTFT <= 300.0ms, TPOT <= 10.0ms
----------------------------------------------------------------------------
Overall best system chosen: disagg at 812.48 tokens/s/gpu (2.39x better)
- Agg Actual Best: 340.48 tokens/s/gpu 100.83 tokens/s/user | TTFT: 188.91ms TPOT: 9.92ms
- Disagg Actual Best: 812.48 tokens/s/gpu 109.12 tokens/s/user | TTFT: 276.94ms TPOT: 9.16ms
----------------------------------------------------------------------------
Pareto Frontier:
QWEN3_32B Pareto Frontier: tokens/s/gpu vs tokens/s/user
┌────────────────────────────────────────────────────────────────────────┐
1600.0┤ dd Disagg │
│ aa Agg │
│ XX Best │
│ │
1333.3┤ a │
│ a │
│ aaaa d │
│ a ddddddddd │
1066.7┤ a dd │
│ aa dddddddd │
│ aaa dd │
│ a d │
800.0┤ a dddddddXdd │
│ aaaa d │
│ aaa d │
│ aa d │
533.3┤ aaaaaa dd │
│ aa dd │
│ aa dd │
│ aaaaaa ddd │
266.7┤ aaaaa d │
│ aaaaaaa │
│ aaaaaaa │
│ │
0.0┤ │
└┬─────────────────┬─────────────────┬────────────────┬─────────────────┬┘
0 45 90 135 180
tokens/s/gpu tokens/s/user
----------------------------------------------------------------------------
Worker Setup:
Model: QWEN3_32B (is_moe: False)
Disagg Prefill: h200_sxm (trtllm)
Disagg Decode: h200_sxm (trtllm)
Prefill Quantization: GEMM: fp8_block, KVCache: fp8, FMHA: fp8
Decode Quantization: GEMM: fp8_block, KVCache: fp8, FMHA: fp8
Agg: h200_sxm (trtllm)
Quantization: GEMM: fp8_block, KVCache: fp8, FMHA: fp8
----------------------------------------------------------------------------
Deployment Details:
(p) stands for prefill, (d) stands for decode, bs stands for batch size, a replica stands for the smallest scalable unit xPyD of the disagg system
Some math: total gpus used = replicas * gpus/replica
gpus/replica = (p)gpus/worker * (p)workers + (d)gpus/worker * (d)workers; for Agg, gpus/replica = gpus/worker
gpus/worker = tp * pp * dp = etp * ep * pp for MoE models; tp * pp for dense models (underlined numbers are the actual values in math)
Disagg Top Configurations: (Sorted by tokens/s/gpu)
+------+--------------+---------------+-------------+------------------+----------+----------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
| Rank | tokens/s/gpu | tokens/s/user | concurrency | total_gpus(used) | replicas | gpus/replica | (p)workers | (p)gpus/worker | (p)parallel | (p)bs | (d)workers | (d)gpus/worker | (d)parallel | (d)bs |
+------+--------------+---------------+-------------+------------------+----------+----------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
| 1 | 812.48 | 109.12 | 60 | 512 (512=64x8) | 64 | 8 (=4x1+1x4) | 4 | 1 (=1x1) | tp1pp1 | 1 | 1 | 4 (=4x1) | tp4pp1 | 60 |
| 2 | 802.97 | 100.56 | 204 | 512 (500=20x25) | 20 | 25 (=13x1+3x4) | 13 | 1 (=1x1) | tp1pp1 | 1 | 3 | 4 (=4x1) | tp4pp1 | 68 |
| 3 | 802.09 | 106.73 | 192 | 512 (500=20x25) | 20 | 25 (=13x1+3x4) | 13 | 1 (=1x1) | tp1pp1 | 1 | 3 | 4 (=4x1) | tp4pp1 | 64 |
| 4 | 767.19 | 114.22 | 156 | 512 (506=22x23) | 22 | 23 (=11x1+3x4) | 11 | 1 (=1x1) | tp1pp1 | 1 | 3 | 4 (=4x1) | tp4pp1 | 52 |
| 5 | 761.70 | 111.61 | 224 | 512 (496=16x31) | 16 | 31 (=15x1+4x4) | 15 | 1 (=1x1) | tp1pp1 | 1 | 4 | 4 (=4x1) | tp4pp1 | 56 |
+------+--------------+---------------+-------------+------------------+----------+----------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
Agg Top Configurations: (Sorted by tokens/s/gpu)
+------+--------------+---------------+-------------+------------------+----------+--------------+-------------+----------+----+
| Rank | tokens/s/gpu | tokens/s/user | concurrency | total_gpus(used) | replicas | gpus/replica | gpus/worker | parallel | bs |
+------+--------------+---------------+-------------+------------------+----------+--------------+-------------+----------+----+
| 1 | 340.48 | 100.83 | 15 | 512 (512=128x4) | 128 | 4 | 4 (=4x1) | tp4pp1 | 15 |
| 2 | 326.78 | 104.48 | 14 | 512 (512=128x4) | 128 | 4 | 4 (=4x1) | tp4pp1 | 14 |
| 3 | 307.50 | 105.57 | 13 | 512 (512=128x4) | 128 | 4 | 4 (=4x1) | tp4pp1 | 13 |
| 4 | 296.61 | 107.15 | 24 | 512 (512=64x8) | 64 | 8 | 8 (=8x1) | tp8pp1 | 24 |
| 5 | 265.44 | 115.81 | 20 | 512 (512=64x8) | 64 | 8 | 8 (=8x1) | tp8pp1 | 20 |
+------+--------------+---------------+-------------+------------------+----------+--------------+-------------+----------+----+
********************************************************************************
INFO 2025-07-28 17:23:10,701 main.py:1035] Configuration completed in 48.18 seconds
These results indicate that deploying Qwen3-32B on h200_sxm
in FP8 can achieve 2.39x higher tokens/s/gpu for disaggregated versus aggregated deployment under the SLA targets TTFT ≤ 300 ms and TPOT ≤ 10 ms, with ISL:OSL of 4000:500.
Try different ISL:OSL values and SLA limits to fit your use case, for example:
aiconfigurator cli --model QWEN3_32B --total_gpus 32 --system h200_sxm --ttft 200 --tpot 10 --isl 8000 --osl 200
You will get different results.
To customize further (including the search space and per-component quantization) parameters are defined in a YAML file.
Built-in YAML files are under src/aiconfigurator/cli/templates/trtllm/xxx_default.yaml
(in the future, trtllm
can be replaced by other backend names).
Refer to the YAML file and modify as needed. Pass your customized YAML file with --yaml_path
:
aiconfigurator cli --model QWEN3_32B --total_gpus 32 --system h200_sxm --ttft 200 --tpot 10 --isl 8000 --osl 200 --yaml_path customized_config.yaml
For guidance on tuning these parameters, refer to Advanced Tuning.
In the aiconfigurator
CLI, if you specify --save_dir
, the tool generates configuration files for deploying with Dynamo.
This feature bridges the gap between configuration and Dynamo deployment.
The folder structure looks like this:
backend_configs/
├── agg/
│ ├── agg_config.yaml
│ └── node_0_run.sh
└── disagg/
│ ├── decode_config.yaml
│ ├── prefill_config.yaml
│ ├── node_0_run.sh
│ ├── node_1_run.sh
│ └── ...
└──
Refer to the Deployment Guide for details.
To further simpify the end-to-end user experience, we're now supporting automate everything in one script, starting from configuring the deployment, generating the configs, preparing docker image and container, pulling model checkpoints, deploying the service, benchmarking and summarizing.
bash launch_eval.sh config.env
Everything is in one command! We're trying to integrate our expertise to make the deployment smarter. Refer to Automation for more details.
aiconfigurator webapp
Visit 127.0.0.1:7860
.
Refer to Advanced Tuning and the webapp README tab before running experiments.
There are many features, such as different quantizations and parallelism strategies, to tune performance beyond the default configurations. These apply to both the CLI and the webapp. Refer to Advanced Tuning for details.
LLM inference performance is dominated by:
- Compute cost (such as GEMM and attention).
- Communication cost (such as all-reduce for tensor parallel and P2P for pipeline parallel).
To estimate performance, we take the following steps:
- Break down LLM inference into operations: GEMM, attention, communication, embedding, element-wise operations, and others.
- Collect operation execution times on the target hardware.
- Estimate end-to-end execution time for a configuration by composing operation times using interpolation and extrapolation.
- Model in-flight batching (aggregated) and disaggregated serving on top of that.
- Search thousands of combinations to find strong configurations and generate Dynamo configuration files based on the results.
- Models:
- GPT
- LLAMA (2, 3)
- MOE
- QWEN
- DEEPSEEK_V3
- Operations:
- Attention
- MHA/GQA (FP8, FP16)
- MLA (FP8, FP16)
- KV Cache (FP16, FP8, INT8)
- GEMM (FP16, FP8, FP8-Block, FP8-OOTB, SQ, INT8 WO, INT4 WO, NVFP4)
- AllReduce (FP16)
- Embedding
- P2P
- ElementWise
- NCCL (all_reduce, all_gather, all-to-all, reduce_scatter)
- MoE (FP16, FP8, FP8-Block, W4A-FP8, INT4 WO, NVFP4)
- MLA BMM (FP16, FP8)
- Attention
- TRTLLM Versions
- 0.20.0
- 1.0.0rc3
- Parallel modes:
- Tensor-parallel
- Pipeline-parallel
- Expert Tensor-parallel/Expert-parallel
- Attention DP (for DEEPSEEK and MoE)
- Scheduling:
- Static
- IFB (continuous batching)
- Disaggregated serving
- MTP (for DEEPSEEK)
Data collection is a standalone process for building the database used by aiconfigurator. By default, you do not need to collect data yourself.
Small changes to the database may not materially change performance estimates. For example, you can use 1.0.0rc3 data of trtllm
on h200_sxm
and deploy the generated configuration with Dynamo and a trtllm
1.0.0rc4 worker.
To go through the process, refer to the guidance under the collector
folder.
System | Framework(Version) | Status |
---|---|---|
h100_sxm | TRTLLM(0.20.0, 1.0.0rc3) | ✅ |
h200_sxm | TRTLLM(0.20.0, 1.0.0rc3) | ✅ |
b200_sxm | TRTLLM(NA) | 🚧 |
Adding a new model needs to modify the source code and perhaps to collect new data for the model. Please refer to add_a_new_model
- MoE memory estimation for the
trtllm
backend needs to consider workspace. - Results can be overly optimistic in the low-speed, high-throughput region.
Note: The results are not final or absolute. They can be inaccurate due to modeling gaps or indicate performance improvement opportunities. The tool aims to align with the framework's current implementation and to provide configuration suggestions. Verify results in real benchmarks with the generated configurations and perform follow-up tuning.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for aiconfigurator
Similar Open Source Tools

aiconfigurator
The `aiconfigurator` tool assists in finding a strong starting configuration for disaggregated serving in AI deployments. It helps optimize throughput at a given latency by evaluating thousands of configurations based on model, GPU count, and GPU type. The tool models LLM inference using collected data for a target machine and framework, running via CLI and web app. It generates configuration files for deployment with Dynamo, offering features like customized configuration, all-in-one automation, and tuning with advanced features. The tool estimates performance by breaking down LLM inference into operations, collecting operation execution times, and searching for strong configurations. Supported features include models like GPT and operations like attention, KV cache, GEMM, AllReduce, embedding, P2P, element-wise, MoE, MLA BMM, TRTLLM versions, and parallel modes like tensor-parallel and pipeline-parallel.

langchain_dart
LangChain.dart is a Dart port of the popular LangChain Python framework created by Harrison Chase. LangChain provides a set of ready-to-use components for working with language models and a standard interface for chaining them together to formulate more advanced use cases (e.g. chatbots, Q&A with RAG, agents, summarization, extraction, etc.). The components can be grouped into a few core modules: * **Model I/O:** LangChain offers a unified API for interacting with various LLM providers (e.g. OpenAI, Google, Mistral, Ollama, etc.), allowing developers to switch between them with ease. Additionally, it provides tools for managing model inputs (prompt templates and example selectors) and parsing the resulting model outputs (output parsers). * **Retrieval:** assists in loading user data (via document loaders), transforming it (with text splitters), extracting its meaning (using embedding models), storing (in vector stores) and retrieving it (through retrievers) so that it can be used to ground the model's responses (i.e. Retrieval-Augmented Generation or RAG). * **Agents:** "bots" that leverage LLMs to make informed decisions about which available tools (such as web search, calculators, database lookup, etc.) to use to accomplish the designated task. The different components can be composed together using the LangChain Expression Language (LCEL).

nncf
Neural Network Compression Framework (NNCF) provides a suite of post-training and training-time algorithms for optimizing inference of neural networks in OpenVINO™ with a minimal accuracy drop. It is designed to work with models from PyTorch, TorchFX, TensorFlow, ONNX, and OpenVINO™. NNCF offers samples demonstrating compression algorithms for various use cases and models, with the ability to add different compression algorithms easily. It supports GPU-accelerated layers, distributed training, and seamless combination of pruning, sparsity, and quantization algorithms. NNCF allows exporting compressed models to ONNX or TensorFlow formats for use with OpenVINO™ toolkit, and supports Accuracy-Aware model training pipelines via Adaptive Compression Level Training and Early Exit Training.

LitServe
LitServe is a high-throughput serving engine designed for deploying AI models at scale. It generates an API endpoint for models, handles batching, streaming, and autoscaling across CPU/GPUs. LitServe is built for enterprise scale with a focus on minimal, hackable code-base without bloat. It supports various model types like LLMs, vision, time-series, and works with frameworks like PyTorch, JAX, Tensorflow, and more. The tool allows users to focus on model performance rather than serving boilerplate, providing full control and flexibility.

DeepRetrieval
DeepRetrieval is a tool designed to enhance search engines and retrievers using Large Language Models (LLMs) and Reinforcement Learning (RL). It allows LLMs to learn how to search effectively by integrating with search engine APIs and customizing reward functions. The tool provides functionalities for data preparation, training, evaluation, and monitoring search performance. DeepRetrieval aims to improve information retrieval tasks by leveraging advanced AI techniques.

IDvs.MoRec
This repository contains the source code for the SIGIR 2023 paper 'Where to Go Next for Recommender Systems? ID- vs. Modality-based Recommender Models Revisited'. It provides resources for evaluating foundation, transferable, multi-modal, and LLM recommendation models, along with datasets, pre-trained models, and training strategies for IDRec and MoRec using in-batch debiased cross-entropy loss. The repository also offers large-scale datasets, code for SASRec with in-batch debias cross-entropy loss, and information on joining the lab for research opportunities.

llm4ad
LLM4AD is an open-source Python-based platform leveraging Large Language Models (LLMs) for Automatic Algorithm Design (AD). It provides unified interfaces for methods, tasks, and LLMs, along with features like evaluation acceleration, secure evaluation, logs, GUI support, and more. The platform was originally developed for optimization tasks but is versatile enough to be used in other areas such as machine learning, science discovery, game theory, and engineering design. It offers various search methods and algorithm design tasks across different domains. LLM4AD supports remote LLM API, local HuggingFace LLM deployment, and custom LLM interfaces. The project is licensed under the MIT License and welcomes contributions, collaborations, and issue reports.

KwaiAgents
KwaiAgents is a series of Agent-related works open-sourced by the [KwaiKEG](https://github.com/KwaiKEG) from [Kuaishou Technology](https://www.kuaishou.com/en). The open-sourced content includes: 1. **KAgentSys-Lite**: a lite version of the KAgentSys in the paper. While retaining some of the original system's functionality, KAgentSys-Lite has certain differences and limitations when compared to its full-featured counterpart, such as: (1) a more limited set of tools; (2) a lack of memory mechanisms; (3) slightly reduced performance capabilities; and (4) a different codebase, as it evolves from open-source projects like BabyAGI and Auto-GPT. Despite these modifications, KAgentSys-Lite still delivers comparable performance among numerous open-source Agent systems available. 2. **KAgentLMs**: a series of large language models with agent capabilities such as planning, reflection, and tool-use, acquired through the Meta-agent tuning proposed in the paper. 3. **KAgentInstruct**: over 200k Agent-related instructions finetuning data (partially human-edited) proposed in the paper. 4. **KAgentBench**: over 3,000 human-edited, automated evaluation data for testing Agent capabilities, with evaluation dimensions including planning, tool-use, reflection, concluding, and profiling.

airswap-protocols
AirSwap Protocols is a repository containing smart contracts for developers and traders on the AirSwap peer-to-peer trading network. It includes various packages for functionalities like server registry, atomic token swap, staking, rewards pool, batch token and order calls, libraries, and utils. The repository follows a branching and release process for contracts and tools, with steps for regular development process and individual package features or patches. Users can deploy and verify contracts using specific commands with network flags.

LocalAI
LocalAI is a free and open-source OpenAI alternative that acts as a drop-in replacement REST API compatible with OpenAI (Elevenlabs, Anthropic, etc.) API specifications for local AI inferencing. It allows users to run LLMs, generate images, audio, and more locally or on-premises with consumer-grade hardware, supporting multiple model families and not requiring a GPU. LocalAI offers features such as text generation with GPTs, text-to-audio, audio-to-text transcription, image generation with stable diffusion, OpenAI functions, embeddings generation for vector databases, constrained grammars, downloading models directly from Huggingface, and a Vision API. It provides a detailed step-by-step introduction in its Getting Started guide and supports community integrations such as custom containers, WebUIs, model galleries, and various bots for Discord, Slack, and Telegram. LocalAI also offers resources like an LLM fine-tuning guide, instructions for local building and Kubernetes installation, projects integrating LocalAI, and a how-tos section curated by the community. It encourages users to cite the repository when utilizing it in downstream projects and acknowledges the contributions of various software from the community.

auto-dev
AutoDev is an AI-powered coding wizard that supports multiple languages, including Java, Kotlin, JavaScript/TypeScript, Rust, Python, Golang, C/C++/OC, and more. It offers a range of features, including auto development mode, copilot mode, chat with AI, customization options, SDLC support, custom AI agent integration, and language features such as language support, extensions, and a DevIns language for AI agent development. AutoDev is designed to assist developers with tasks such as auto code generation, bug detection, code explanation, exception tracing, commit message generation, code review content generation, smart refactoring, Dockerfile generation, CI/CD config file generation, and custom shell/command generation. It also provides a built-in LLM fine-tune model and supports UnitEval for LLM result evaluation and UnitGen for code-LLM fine-tune data generation.

Liger-Kernel
Liger Kernel is a collection of Triton kernels designed for LLM training, increasing training throughput by 20% and reducing memory usage by 60%. It includes Hugging Face Compatible modules like RMSNorm, RoPE, SwiGLU, CrossEntropy, and FusedLinearCrossEntropy. The tool works with Flash Attention, PyTorch FSDP, and Microsoft DeepSpeed, aiming to enhance model efficiency and performance for researchers, ML practitioners, and curious novices.

llama-gpt
LlamaGPT is a self-hosted, offline, ChatGPT-like chatbot, powered by Llama 2. It is 100% private, with no data leaving your device. It supports various models, including Nous Hermes Llama 2 7B/13B/70B Chat and Code Llama 7B/13B/34B Chat. You can install LlamaGPT on your umbrelOS home server, M1/M2 Mac, or anywhere else with Docker. It also provides an OpenAI-compatible API for easy integration. LlamaGPT is still under development, with plans to add more features such as custom model loading and model switching.

RobustVLM
This repository contains code for the paper 'Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models'. It focuses on fine-tuning CLIP in an unsupervised manner to enhance its robustness against visual adversarial attacks. By replacing the vision encoder of large vision-language models with the fine-tuned CLIP models, it achieves state-of-the-art adversarial robustness on various vision-language tasks. The repository provides adversarially fine-tuned ViT-L/14 CLIP models and offers insights into zero-shot classification settings and clean accuracy improvements.

BitBLAS
BitBLAS is a library for mixed-precision BLAS operations on GPUs, for example, the $W_{wdtype}A_{adtype}$ mixed-precision matrix multiplication where $C_{cdtype}[M, N] = A_{adtype}[M, K] \times W_{wdtype}[N, K]$. BitBLAS aims to support efficient mixed-precision DNN model deployment, especially the $W_{wdtype}A_{adtype}$ quantization in large language models (LLMs), for example, the $W_{UINT4}A_{FP16}$ in GPTQ, the $W_{INT2}A_{FP16}$ in BitDistiller, the $W_{INT2}A_{INT8}$ in BitNet-b1.58. BitBLAS is based on techniques from our accepted submission at OSDI'24.

EasyEdit
EasyEdit is a Python package for edit Large Language Models (LLM) like `GPT-J`, `Llama`, `GPT-NEO`, `GPT2`, `T5`(support models from **1B** to **65B**), the objective of which is to alter the behavior of LLMs efficiently within a specific domain without negatively impacting performance across other inputs. It is designed to be easy to use and easy to extend.
For similar tasks

SuperCoder
SuperCoder is an open-source autonomous software development system that leverages advanced AI tools and agents to streamline and automate coding, testing, and deployment tasks, enhancing efficiency and reliability. It supports a variety of languages and frameworks for diverse development needs. Users can set up the environment variables, build and run the Go server, Asynq worker, and Postgres using Docker and Docker Compose. The project is under active development and may still have issues, but users can seek help and support from the Discord community or by creating new issues on GitHub.

aiconfigurator
The `aiconfigurator` tool assists in finding a strong starting configuration for disaggregated serving in AI deployments. It helps optimize throughput at a given latency by evaluating thousands of configurations based on model, GPU count, and GPU type. The tool models LLM inference using collected data for a target machine and framework, running via CLI and web app. It generates configuration files for deployment with Dynamo, offering features like customized configuration, all-in-one automation, and tuning with advanced features. The tool estimates performance by breaking down LLM inference into operations, collecting operation execution times, and searching for strong configurations. Supported features include models like GPT and operations like attention, KV cache, GEMM, AllReduce, embedding, P2P, element-wise, MoE, MLA BMM, TRTLLM versions, and parallel modes like tensor-parallel and pipeline-parallel.

ai-toolbox
AI-Toolbox is a collection of automation scripts and tools designed to streamline AI workflows. It simplifies the installation process of various AI applications, making software deployment effortless for data scientists, researchers, and developers. The toolbox offers automated installation of multiple applications, customization for specific workflows, easy-to-use scripts, and receives regular updates and contributions from the community.

aimp-discord-presence
AIMP - Discord Presence is a plugin for AIMP that changes the status of Discord based on the music you are listening to. It allows users to share their detected activity with others on Discord. The plugin settings are stored in the AIMP configuration file, and users can customize various options such as application ID, timestamp, album art display, and image settings for different playback states.

ChatDev
ChatDev is a virtual software company powered by intelligent agents like CEO, CPO, CTO, programmer, reviewer, tester, and art designer. These agents collaborate to revolutionize the digital world through programming. The platform offers an easy-to-use, highly customizable, and extendable framework based on large language models, ideal for studying collective intelligence. ChatDev introduces innovative methods like Iterative Experience Refinement and Experiential Co-Learning to enhance software development efficiency. It supports features like incremental development, Docker integration, Git mode, and Human-Agent-Interaction mode. Users can customize ChatChain, Phase, and Role settings, and share their software creations easily. The project is open-source under the Apache 2.0 License and utilizes data licensed under CC BY-NC 4.0.

novelai-bot
This repository contains a drawing plugin based on NovelAI. It allows users to draw images, change models, samplers, and image sizes, use advanced request syntax, customize prohibited word lists, automatically translate Chinese keywords, automatically retract messages after a certain time, and connect to private servers. Thanks to Koishi's plugin mechanism, users can achieve more functionalities by combining it with other plugins, such as multi-platform support, rate limiting, context management, and multi-language support.

WaveRobloxExecutor
Wave Executor is a cutting-edge script executor tailored for Roblox enthusiasts, offering AI integration for seamless script development, ad-free premium features, and 24/7 customer support. It enhances your Roblox gameplay experience by providing a wide range of features to take your gameplay to new heights.

gpt_mobile
GPT Mobile is a chat assistant for Android that allows users to chat with multiple models at once. It supports various platforms such as OpenAI GPT, Anthropic Claude, and Google Gemini. Users can customize temperature, top p (Nucleus sampling), and system prompt. The app features local chat history, Material You style UI, dark mode support, and per app language setting for Android 13+. It is built using 100% Kotlin, Jetpack Compose, and follows a modern app architecture for Android developers.
For similar jobs

aiconfigurator
The `aiconfigurator` tool assists in finding a strong starting configuration for disaggregated serving in AI deployments. It helps optimize throughput at a given latency by evaluating thousands of configurations based on model, GPU count, and GPU type. The tool models LLM inference using collected data for a target machine and framework, running via CLI and web app. It generates configuration files for deployment with Dynamo, offering features like customized configuration, all-in-one automation, and tuning with advanced features. The tool estimates performance by breaking down LLM inference into operations, collecting operation execution times, and searching for strong configurations. Supported features include models like GPT and operations like attention, KV cache, GEMM, AllReduce, embedding, P2P, element-wise, MoE, MLA BMM, TRTLLM versions, and parallel modes like tensor-parallel and pipeline-parallel.

sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.