surogate
Insanely fast LLM pre-training and fine-tuning for modern NVIDIA GPUs. Enterprise-grade LLMOps.
Stars: 83
Surogate is a high-performance, mixed-precision LLM pre-training and fine-tuning framework designed for practical hardware limits. It combines a native C++/CUDA execution engine, a low-overhead Python frontend, and a highly optimized multi-threaded scheduler to achieve industry-leading Speed-Of-Light (SOL) utilization on NVIDIA GPUs. Surogate Studio is the no-code companion for enterprise-grade LLMOps, offering features like pre-training, fine-tuning, native multi-GPU and multi-Node training, smart CPU offloading, pre-built training recipes, mixed-precision training, and adaptive training. It supports various NVIDIA GPUs and provides a unified platform for managing AI infrastructure and operations.
README:
LLM pre-training & fine-tuning at practical hardware limits
(C++/CUDA core, Python wrapper, BF16, FP8, NF4, NVFP4)
Surogate is an extremely fast production-grade LLM training framework and LLMOps platform engineered to operate at practical hardware limits, delivering near–speed-of-light throughput, low-latency execution, and multi-GPU/multi-Node training scaling at scale.
By combining a native C++/CUDA execution engine, a low-overhead Python frontend, and a highly optimized multi-threaded scheduler, Surogate achieves industry-leading Speed-Of-Light (SOL) utilization on NVIDIA GPUs — outperforming existing training toolkits by a wide margin.
See reproducible comparisons in Benchmarks.
Surogate Studio is the no-code companion of Suroagate for enterprise-grade LLMOps.
Surogate is built for developers and enterprises that need fast experimentation scalability and predictable outcomes — whether running on-premise, in private clouds, or inside turnkey systems such as the DenseMAX Appliance.
- 🔧 Pre-training + Fine-tuning: full fine-tuning, LoRA/QLoRA
- 🖥️...🖥️ Native multi-GPU training with multi-threaded backend
- 🖥️...🖥️ Native multi-Node DDP training with Ray
- ⚡ Native C++/CUDA engine for near–Speed-Of-Light (SOL) throughput
- 🔥 Python DSL with AOT auto-differentiation for adding new model architectures
- ⚖️ Smart CPU Offloading for weights, gradients, activations, quants
-
📜 Pre-built training recipes:
-
💎 BF16: Baseline recipe using
bfloat16for all GEMMs, designed for maximum numerical accuracy. No quantization is applied. -
🔥 FP8: Native
FP8training delivering extreme performance withE4M3used for activations and weights andE5M2for gradients. Uses per-tensor delayed scaling to provide stable training. -
🔥 NVFP4: Native CUTLASS
FP4 E2M1training with two-level block scaling for extreme performance and memory efficiency on Blackwell GPUs (SM100+: B200, B300, RTX 50xx series). Uses stochastic rounding and random Hadamard Transforms for numerical stability. Supports NVIDIA B200, B300, RTX 5070, 5080, 5090 !!
-
💎 BF16: Baseline recipe using
- ⚡ BnB/FP8/NVFP4 QLoRA Support for a variety of QLoRA configurations, including online quantization (FP8, NVFP4, BnB) or loading pre-quantized weights (FP8, NVFP4)
- 👌 Optimizers: AdamW 8bit, !! NorMuon !!
- 🖥️ Runs on all NVIDIA GPUs: sm80, sm86, sm89, sm90, sm100, sm103, sm120, sm121
- 🧪 Mixed-precision training: Mix different dtypes for GEMMs, model, gradients and LoRA recipes to create your own flavor.
- 🛡️ Designed for reliability: deterministic configs, explicit recipes, and a clear C++ core
- 🧬 Adaptive Training: built-in automated training monitoring with automatic phase detection, multi-criteria early stopping (convergence, compute-efficiency, divergence, plateau), auto LR management, MoE imbalance detection, Chinchilla token budgeting and dynamic epoch adjustment
- 🎨 Dedicated MoE Features: Expert Parallelism, Least-Loaded EP load-balancing, MoE training metrics, Imbalance detection
- 🥞 Stacked LoRA training: Train a LoRA adapter on top of another LoRA adapter to skip offline merging into base model.
- 🧠 Supported models: Qwen2.5, Qwen3, Qwen3 MoE, Llama 3+, Nemotron Nano. Models can be added easily, please create a PR if you need a specific model.
- Surogate Studio: Unified no-code platform for managing your AI infrastructure and operations: Training, Fine-Tuning, Inference and Quantization
Surogate provides 3 docker images for various CUDA versions. Currently only the x86-64 architecture is supported.
| CUDA | Image | Recommended NVIDIA Driver | Minimum NVIDIA Driver |
|---|---|---|---|
| 12.8.1 | ghcr.io/invergent-ai/surogate:latest-cu128 |
>= 570.124.06 |
>= 525 |
| 12.9.1 | ghcr.io/invergent-ai/surogate:latest-cu129 |
>= 575.57.08 |
>= 525 |
| 13.1 | ghcr.io/invergent-ai/surogate:latest-cu13 |
>= 590.48.01 |
>= 580 |
docker run --gpus=all -v /my/local/config.yaml:/home/surogate/config.yaml -v /my/local/output_dir:<OUTPUT_DIR_FROM_CONFIG_YAML> <IMAGE> sft config.yamlcurl -LsSf https://surogate.ai/install.sh | shYou need CUDA 12.8/12.9/13.x installed on your machine and NCCL development libraries libnccl-dev for your CUDA version
# ...clone repo...
uv pip install -e .- Create a config (example):
model: Qwen/Qwen3-0.6B
output_dir: ./output
# training
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
sequence_len: 2048
learning_rate: 2e-4
# LoRA / QLoRA
lora: true
lora_rank: 16
# qlora_fp8: true # optional, hardware-dependent
# qlora_fp4: true # Blackwell+
# qlora_bnb: true # Any GPU, lowest
datasets:
- path: "mlabonne/FineTome-100k"
type: auto- Run:
surogate sft config.yaml- Outputs:
- checkpoints, logs and artifacts are written under
output_dir
- NVIDIA GPU + recent driver
- CUDA 12.8, 12.9, 13, NCCL, cuDNN
- Linux x86_64
-
SM80: A100, A30 -
SM86: A2, A16, A10, A40, RTX3050, RTX3060, RTX 3070, RTX 3080, RTX 3090, A2000, A3000, A4000, A5000, A6000 -
SM89: L4, L40, L40S, RTX 4050, RTX 4060, RTX 4070, RTX 4080, RTX 4090, RTX 2000 Ada, RTX 4000 SFF Ada, RTX 4000 Ada, RTX 4500 Ada, RTX 5000 Ada, RTX 6000 Ada -
SM90: H100, H200, GH200 -
SM100: B200, GB200 -
SM103: B300, GB300 -
SM120: RTX PRO 6000/5000/4000/2500/2000 Blackwell, RTX 5050, RTX 5060, RTX 5070, RTX 5080, RTX 5090 -
SM121: DGX Spark
- Docs: https://docs.surogate.ai
- Examples: https://github.com/invergent-ai/surogate/tree/master/examples
Harness the full power of Surogate with Surogate Studio — a fully visual, easy-to-use platform for fine-tuning, deployment, evaluation, and production LLMOps.
PRs and issues are welcome. If you’re adding kernels/recipes or touching build/tooling, please keep changes minimal and include:
- a short description of the change,
- how to reproduce/validate locally (
make testwhere applicable), - and any GPU/arch assumptions.
Apache 2.0 — see LICENSE.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for surogate
Similar Open Source Tools
surogate
Surogate is a high-performance, mixed-precision LLM pre-training and fine-tuning framework designed for practical hardware limits. It combines a native C++/CUDA execution engine, a low-overhead Python frontend, and a highly optimized multi-threaded scheduler to achieve industry-leading Speed-Of-Light (SOL) utilization on NVIDIA GPUs. Surogate Studio is the no-code companion for enterprise-grade LLMOps, offering features like pre-training, fine-tuning, native multi-GPU and multi-Node training, smart CPU offloading, pre-built training recipes, mixed-precision training, and adaptive training. It supports various NVIDIA GPUs and provides a unified platform for managing AI infrastructure and operations.
transformerlab-app
Transformer Lab is an app that allows users to experiment with Large Language Models by providing features such as one-click download of popular models, finetuning across different hardware, RLHF and Preference Optimization, working with LLMs across different operating systems, chatting with models, using different inference engines, evaluating models, building datasets for training, calculating embeddings, providing a full REST API, running in the cloud, converting models across platforms, supporting plugins, embedded Monaco code editor, prompt editing, inference logs, all through a simple cross-platform GUI.
executorch
ExecuTorch is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch Edge ecosystem and enables efficient deployment of PyTorch models to edge devices. Key value propositions of ExecuTorch are: * **Portability:** Compatibility with a wide variety of computing platforms, from high-end mobile phones to highly constrained embedded systems and microcontrollers. * **Productivity:** Enabling developers to use the same toolchains and SDK from PyTorch model authoring and conversion, to debugging and deployment to a wide variety of platforms. * **Performance:** Providing end users with a seamless and high-performance experience due to a lightweight runtime and utilizing full hardware capabilities such as CPUs, NPUs, and DSPs.
flashinfer
FlashInfer is a library for Language Languages Models that provides high-performance implementation of LLM GPU kernels such as FlashAttention, PageAttention and LoRA. FlashInfer focus on LLM serving and inference, and delivers state-the-art performance across diverse scenarios.
mistral.rs
Mistral.rs is a fast LLM inference platform written in Rust. We support inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.
nodetool
NodeTool is a platform designed for AI enthusiasts, developers, and creators, providing a visual interface to access a variety of AI tools and models. It simplifies access to advanced AI technologies, offering resources for content creation, data analysis, automation, and more. With features like a visual editor, seamless integration with leading AI platforms, model manager, and API integration, NodeTool caters to both newcomers and experienced users in the AI field.
stable-diffusion.cpp
The stable-diffusion.cpp repository provides an implementation for inferring stable diffusion in pure C/C++. It offers features such as support for different versions of stable diffusion, lightweight and dependency-free implementation, various quantization support, memory-efficient CPU inference, GPU acceleration, and more. Users can download the built executable program or build it manually. The repository also includes instructions for downloading weights, building from scratch, using different acceleration methods, running the tool, converting weights, and utilizing various features like Flash Attention, ESRGAN upscaling, PhotoMaker support, and more. Additionally, it mentions future TODOs and provides information on memory requirements, bindings, UIs, contributors, and references.
jan
Jan is an open-source ChatGPT alternative that runs 100% offline on your computer. It supports universal architectures, including Nvidia GPUs, Apple M-series, Apple Intel, Linux Debian, and Windows x64. Jan is currently in development, so expect breaking changes and bugs. It is lightweight and embeddable, and can be used on its own within your own projects.
lighteval
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron. We're releasing it with the community in the spirit of building in the open. Note that it is still very much early so don't expect 100% stability ^^' In case of problems or question, feel free to open an issue!
local-cocoa
Local Cocoa is a privacy-focused tool that runs entirely on your device, turning files into memory to spark insights and power actions. It offers features like fully local privacy, multimodal memory, vector-powered retrieval, intelligent indexing, vision understanding, hardware acceleration, focused user experience, integrated notes, and auto-sync. The tool combines file ingestion, intelligent chunking, and local retrieval to build a private on-device knowledge system. The ultimate goal includes more connectors like Google Drive integration, voice mode for local speech-to-text interaction, and a plugin ecosystem for community tools and agents. Local Cocoa is built using Electron, React, TypeScript, FastAPI, llama.cpp, and Qdrant.
orchestkit
OrchestKit is a powerful and flexible orchestration tool designed to streamline and automate complex workflows. It provides a user-friendly interface for defining and managing orchestration tasks, allowing users to easily create, schedule, and monitor workflows. With support for various integrations and plugins, OrchestKit enables seamless automation of tasks across different systems and applications. Whether you are a developer looking to automate deployment processes or a system administrator managing complex IT operations, OrchestKit offers a comprehensive solution to simplify and optimize your workflow management.
shimmy
Shimmy is a 5.1MB single-binary local inference server providing OpenAI-compatible endpoints for GGUF models. It offers fast, reliable AI inference with sub-second responses, zero configuration, and automatic port management. Perfect for developers seeking privacy, cost-effectiveness, speed, and easy integration with popular tools like VSCode and Cursor. Shimmy is designed to be invisible infrastructure that simplifies local AI development and deployment.
sandboxed.sh
sandboxed.sh is a self-hosted cloud orchestrator for AI coding agents that provides isolated Linux workspaces with Claude Code, OpenCode & Amp runtimes. It allows users to hand off entire development cycles, run multi-day operations unattended, and keep sensitive data local by analyzing data against scientific literature. The tool features dual runtime support, mission control for remote agent management, isolated workspaces, a git-backed library, MCP registry, and multi-platform support with a web dashboard and iOS app.
alphora
Alphora is a full-stack framework for building production AI agents, providing agent orchestration, prompt engineering, tool execution, memory management, streaming, and deployment with an async-first, OpenAI-compatible design. It offers features like agent derivation, reasoning-action loop, async streaming, visual debugger, OpenAI compatibility, multimodal support, tool system with zero-config tools and type safety, prompt engine with dynamic prompts, memory and storage management, sandbox for secure execution, deployment as API, and more. Alphora allows users to build sophisticated AI agents easily and efficiently.
SINQ
SINQ (Sinkhorn-Normalized Quantization) is a novel, fast, and high-quality quantization method designed to make any Large Language Models smaller while keeping their accuracy almost intact. It offers a model-agnostic quantization technique that delivers state-of-the-art performance for Large Language Models without sacrificing accuracy. With SINQ, users can deploy models that would otherwise be too big, drastically reducing memory usage while preserving LLM quality. The tool quantizes models using dual scaling for better quantization and achieves a more even error distribution, leading to stable behavior across layers and consistently higher accuracy even at very low bit-widths.
OpenResearcher
OpenResearcher is a fully open agentic large language model designed for long-horizon deep research scenarios. It achieves an impressive 54.8% accuracy on BrowseComp-Plus, surpassing performance of GPT-4.1, Claude-Opus-4, Gemini-2.5-Pro, DeepSeek-R1, and Tongyi-DeepResearch. The tool is fully open-source, providing the training and evaluation recipe—including data, model, training methodology, and evaluation framework for everyone to progress deep research. It offers features like a fully open-source recipe, highly scalable and low-cost generation of deep research trajectories, and remarkable performance on deep research benchmarks.
For similar tasks
surogate
Surogate is a high-performance, mixed-precision LLM pre-training and fine-tuning framework designed for practical hardware limits. It combines a native C++/CUDA execution engine, a low-overhead Python frontend, and a highly optimized multi-threaded scheduler to achieve industry-leading Speed-Of-Light (SOL) utilization on NVIDIA GPUs. Surogate Studio is the no-code companion for enterprise-grade LLMOps, offering features like pre-training, fine-tuning, native multi-GPU and multi-Node training, smart CPU offloading, pre-built training recipes, mixed-precision training, and adaptive training. It supports various NVIDIA GPUs and provides a unified platform for managing AI infrastructure and operations.
mindsdb
MindsDB is a platform for customizing AI from enterprise data. You can create, serve, and fine-tune models in real-time from your database, vector store, and application data. MindsDB "enhances" SQL syntax with AI capabilities to make it accessible for developers worldwide. With MindsDB’s nearly 200 integrations, any developer can create AI customized for their purpose, faster and more securely. Their AI systems will constantly improve themselves — using companies’ own data, in real-time.
training-operator
Kubeflow Training Operator is a Kubernetes-native project for fine-tuning and scalable distributed training of machine learning (ML) models created with various ML frameworks such as PyTorch, Tensorflow, XGBoost, MPI, Paddle and others. Training Operator allows you to use Kubernetes workloads to effectively train your large models via Kubernetes Custom Resources APIs or using Training Operator Python SDK. > Note: Before v1.2 release, Kubeflow Training Operator only supports TFJob on Kubernetes. * For a complete reference of the custom resource definitions, please refer to the API Definition. * TensorFlow API Definition * PyTorch API Definition * Apache MXNet API Definition * XGBoost API Definition * MPI API Definition * PaddlePaddle API Definition * For details of all-in-one operator design, please refer to the All-in-one Kubeflow Training Operator * For details on its observability, please refer to the monitoring design doc.
helix
HelixML is a private GenAI platform that allows users to deploy the best of open AI in their own data center or VPC while retaining complete data security and control. It includes support for fine-tuning models with drag-and-drop functionality. HelixML brings the best of open source AI to businesses in an ergonomic and scalable way, optimizing the tradeoff between GPU memory and latency.
nntrainer
NNtrainer is a software framework for training neural network models on devices with limited resources. It enables on-device fine-tuning of neural networks using user data for personalization. NNtrainer supports various machine learning algorithms and provides examples for tasks such as few-shot learning, ResNet, VGG, and product rating. It is optimized for embedded devices and utilizes CBLAS and CUBLAS for accelerated calculations. NNtrainer is open source and released under the Apache License version 2.0.
petals
Petals is a tool that allows users to run large language models at home in a BitTorrent-style manner. It enables fine-tuning and inference up to 10x faster than offloading. Users can generate text with distributed models like Llama 2, Falcon, and BLOOM, and fine-tune them for specific tasks directly from their desktop computer or Google Colab. Petals is a community-run system that relies on people sharing their GPUs to increase its capacity and offer a distributed network for hosting model layers.
LLaVA-pp
This repository, LLaVA++, extends the visual capabilities of the LLaVA 1.5 model by incorporating the latest LLMs, Phi-3 Mini Instruct 3.8B, and LLaMA-3 Instruct 8B. It provides various models for instruction-following LMMS and academic-task-oriented datasets, along with training scripts for Phi-3-V and LLaMA-3-V. The repository also includes installation instructions and acknowledgments to related open-source contributions.
KULLM
KULLM (구름) is a Korean Large Language Model developed by Korea University NLP & AI Lab and HIAI Research Institute. It is based on the upstage/SOLAR-10.7B-v1.0 model and has been fine-tuned for instruction. The model has been trained on 8×A100 GPUs and is capable of generating responses in Korean language. KULLM exhibits hallucination and repetition phenomena due to its decoding strategy. Users should be cautious as the model may produce inaccurate or harmful results. Performance may vary in benchmarks without a fixed system prompt.
For similar jobs
llm-resource
llm-resource is a comprehensive collection of high-quality resources for Large Language Models (LLM). It covers various aspects of LLM including algorithms, training, fine-tuning, alignment, inference, data engineering, compression, evaluation, prompt engineering, AI frameworks, AI basics, AI infrastructure, AI compilers, LLM application development, LLM operations, AI systems, and practical implementations. The repository aims to gather and share valuable resources related to LLM for the community to benefit from.
LitServe
LitServe is a high-throughput serving engine designed for deploying AI models at scale. It generates an API endpoint for models, handles batching, streaming, and autoscaling across CPU/GPUs. LitServe is built for enterprise scale with a focus on minimal, hackable code-base without bloat. It supports various model types like LLMs, vision, time-series, and works with frameworks like PyTorch, JAX, Tensorflow, and more. The tool allows users to focus on model performance rather than serving boilerplate, providing full control and flexibility.
how-to-optim-algorithm-in-cuda
This repository documents how to optimize common algorithms based on CUDA. It includes subdirectories with code implementations for specific optimizations. The optimizations cover topics such as compiling PyTorch from source, NVIDIA's reduce optimization, OneFlow's elementwise template, fast atomic add for half data types, upsample nearest2d optimization in OneFlow, optimized indexing in PyTorch, OneFlow's softmax kernel, linear attention optimization, and more. The repository also includes learning resources related to deep learning frameworks, compilers, and optimization techniques.
aiac
AIAC is a library and command line tool to generate Infrastructure as Code (IaC) templates, configurations, utilities, queries, and more via LLM providers such as OpenAI, Amazon Bedrock, and Ollama. Users can define multiple 'backends' targeting different LLM providers and environments using a simple configuration file. The tool allows users to ask a model to generate templates for different scenarios and composes an appropriate request to the selected provider, storing the resulting code to a file and/or printing it to standard output.
ENOVA
ENOVA is an open-source service for Large Language Model (LLM) deployment, monitoring, injection, and auto-scaling. It addresses challenges in deploying stable serverless LLM services on GPU clusters with auto-scaling by deconstructing the LLM service execution process and providing configuration recommendations and performance detection. Users can build and deploy LLM with few command lines, recommend optimal computing resources, experience LLM performance, observe operating status, achieve load balancing, and more. ENOVA ensures stable operation, cost-effectiveness, efficiency, and strong scalability of LLM services.
jina
Jina is a tool that allows users to build multimodal AI services and pipelines using cloud-native technologies. It provides a Pythonic experience for serving ML models and transitioning from local deployment to advanced orchestration frameworks like Docker-Compose, Kubernetes, or Jina AI Cloud. Users can build and serve models for any data type and deep learning framework, design high-performance services with easy scaling, serve LLM models while streaming their output, integrate with Docker containers via Executor Hub, and host on CPU/GPU using Jina AI Cloud. Jina also offers advanced orchestration and scaling capabilities, a smooth transition to the cloud, and easy scalability and concurrency features for applications. Users can deploy to their own cloud or system with Kubernetes and Docker Compose integration, and even deploy to JCloud for autoscaling and monitoring.
vidur
Vidur is a high-fidelity and extensible LLM inference simulator designed for capacity planning, deployment configuration optimization, testing new research ideas, and studying system performance of models under different workloads and configurations. It supports various models and devices, offers chrome trace exports, and can be set up using mamba, venv, or conda. Users can run the simulator with various parameters and monitor metrics using wandb. Contributions are welcome, subject to a Contributor License Agreement and adherence to the Microsoft Open Source Code of Conduct.
AI-System-School
AI System School is a curated list of research in machine learning systems, focusing on ML/DL infra, LLM infra, domain-specific infra, ML/LLM conferences, and general resources. It provides resources such as data processing, training systems, video systems, autoML systems, and more. The repository aims to help users navigate the landscape of AI systems and machine learning infrastructure, offering insights into conferences, surveys, books, videos, courses, and blogs related to the field.
