surogate

surogate

Insanely fast LLM pre-training and fine-tuning for modern NVIDIA GPUs. Enterprise-grade LLMOps.

Stars: 83

Visit
 screenshot

Surogate is a high-performance, mixed-precision LLM pre-training and fine-tuning framework designed for practical hardware limits. It combines a native C++/CUDA execution engine, a low-overhead Python frontend, and a highly optimized multi-threaded scheduler to achieve industry-leading Speed-Of-Light (SOL) utilization on NVIDIA GPUs. Surogate Studio is the no-code companion for enterprise-grade LLMOps, offering features like pre-training, fine-tuning, native multi-GPU and multi-Node training, smart CPU offloading, pre-built training recipes, mixed-precision training, and adaptive training. It supports various NVIDIA GPUs and provides a unified platform for managing AI infrastructure and operations.

README:

Surogate Surogate

LLM pre-training & fine-tuning at practical hardware limits
(C++/CUDA core, Python wrapper, BF16, FP8, NF4, NVFP4)

Home · Docs · Examples · Benchmarks · Studio

GitHub stars GitHub issues GitHub pull requests Twitter Follow

What is Surogate?

Surogate is an extremely fast production-grade LLM training framework and LLMOps platform engineered to operate at practical hardware limits, delivering near–speed-of-light throughput, low-latency execution, and multi-GPU/multi-Node training scaling at scale.

By combining a native C++/CUDA execution engine, a low-overhead Python frontend, and a highly optimized multi-threaded scheduler, Surogate achieves industry-leading Speed-Of-Light (SOL) utilization on NVIDIA GPUs — outperforming existing training toolkits by a wide margin.

See reproducible comparisons in Benchmarks.

Surogate Studio is the no-code companion of Suroagate for enterprise-grade LLMOps.

alt text


✨ Highlights

Surogate is built for developers and enterprises that need fast experimentation scalability and predictable outcomes — whether running on-premise, in private clouds, or inside turnkey systems such as the DenseMAX Appliance.

  • 🔧 Pre-training + Fine-tuning: full fine-tuning, LoRA/QLoRA
  • 🖥️...🖥️ Native multi-GPU training with multi-threaded backend
  • 🖥️...🖥️ Native multi-Node DDP training with Ray
  • ⚡ Native C++/CUDA engine for near–Speed-Of-Light (SOL) throughput
  • 🔥 Python DSL with AOT auto-differentiation for adding new model architectures
  • ⚖️ Smart CPU Offloading for weights, gradients, activations, quants
  • 📜 Pre-built training recipes:
    • 💎 BF16: Baseline recipe using bfloat16 for all GEMMs, designed for maximum numerical accuracy. No quantization is applied.
    • 🔥 FP8: Native FP8 training delivering extreme performance with E4M3 used for activations and weights and E5M2 for gradients. Uses per-tensor delayed scaling to provide stable training.
    • 🔥 NVFP4: Native CUTLASS FP4 E2M1 training with two-level block scaling for extreme performance and memory efficiency on Blackwell GPUs (SM100+: B200, B300, RTX 50xx series). Uses stochastic rounding and random Hadamard Transforms for numerical stability. Supports NVIDIA B200, B300, RTX 5070, 5080, 5090 !!
  • ⚡ BnB/FP8/NVFP4 QLoRA Support for a variety of QLoRA configurations, including online quantization (FP8, NVFP4, BnB) or loading pre-quantized weights (FP8, NVFP4)
  • 👌 Optimizers: AdamW 8bit, !! NorMuon !!
  • 🖥️ Runs on all NVIDIA GPUs: sm80, sm86, sm89, sm90, sm100, sm103, sm120, sm121
  • 🧪 Mixed-precision training: Mix different dtypes for GEMMs, model, gradients and LoRA recipes to create your own flavor.
  • 🛡️ Designed for reliability: deterministic configs, explicit recipes, and a clear C++ core
  • 🧬 Adaptive Training: built-in automated training monitoring with automatic phase detection, multi-criteria early stopping (convergence, compute-efficiency, divergence, plateau), auto LR management, MoE imbalance detection, Chinchilla token budgeting and dynamic epoch adjustment
  • 🎨 Dedicated MoE Features: Expert Parallelism, Least-Loaded EP load-balancing, MoE training metrics, Imbalance detection
  • 🥞 Stacked LoRA training: Train a LoRA adapter on top of another LoRA adapter to skip offline merging into base model.
  • 🧠 Supported models: Qwen2.5, Qwen3, Qwen3 MoE, Llama 3+, Nemotron Nano. Models can be added easily, please create a PR if you need a specific model.
  • Surogate Studio: Unified no-code platform for managing your AI infrastructure and operations: Training, Fine-Tuning, Inference and Quantization

Quickstart

Option A: Run using Docker (recommended)

Surogate provides 3 docker images for various CUDA versions. Currently only the x86-64 architecture is supported.

CUDA Image Recommended NVIDIA Driver Minimum NVIDIA Driver
12.8.1 ghcr.io/invergent-ai/surogate:latest-cu128 >= 570.124.06 >= 525
12.9.1 ghcr.io/invergent-ai/surogate:latest-cu129 >= 575.57.08 >= 525
13.1 ghcr.io/invergent-ai/surogate:latest-cu13 >= 590.48.01 >= 580
docker run --gpus=all -v /my/local/config.yaml:/home/surogate/config.yaml -v /my/local/output_dir:<OUTPUT_DIR_FROM_CONFIG_YAML> <IMAGE> sft config.yaml

Option B: Install via script

curl -LsSf https://surogate.ai/install.sh | sh

Option C: Build from source (dev / contributors)

You need CUDA 12.8/12.9/13.x installed on your machine and NCCL development libraries libnccl-dev for your CUDA version

# ...clone repo...
uv pip install -e .

Quickstart (SFT)

  1. Create a config (example):
model: Qwen/Qwen3-0.6B
output_dir: ./output

# training
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
sequence_len: 2048
learning_rate: 2e-4

# LoRA / QLoRA
lora: true
lora_rank: 16
# qlora_fp8: true  # optional, hardware-dependent
# qlora_fp4: true  # Blackwell+
# qlora_bnb: true  # Any GPU, lowest

datasets:
  - path: "mlabonne/FineTome-100k"
    type: auto
  1. Run:
surogate sft config.yaml
  1. Outputs:
  • checkpoints, logs and artifacts are written under output_dir

Hardware / Requirements

  • NVIDIA GPU + recent driver
  • CUDA 12.8, 12.9, 13, NCCL, cuDNN
  • Linux x86_64

Supported NVIDIA GPUs:

  • SM80: A100, A30
  • SM86: A2, A16, A10, A40, RTX3050, RTX3060, RTX 3070, RTX 3080, RTX 3090, A2000, A3000, A4000, A5000, A6000
  • SM89: L4, L40, L40S, RTX 4050, RTX 4060, RTX 4070, RTX 4080, RTX 4090, RTX 2000 Ada, RTX 4000 SFF Ada, RTX 4000 Ada, RTX 4500 Ada, RTX 5000 Ada, RTX 6000 Ada
  • SM90: H100, H200, GH200
  • SM100: B200, GB200
  • SM103: B300, GB300
  • SM120: RTX PRO 6000/5000/4000/2500/2000 Blackwell, RTX 5050, RTX 5060, RTX 5070, RTX 5080, RTX 5090
  • SM121: DGX Spark

Documentation / Examples


Harness the full power of Surogate with Surogate Studio — a fully visual, easy-to-use platform for fine-tuning, deployment, evaluation, and production LLMOps.


Contributing

PRs and issues are welcome. If you’re adding kernels/recipes or touching build/tooling, please keep changes minimal and include:

  • a short description of the change,
  • how to reproduce/validate locally (make test where applicable),
  • and any GPU/arch assumptions.

License

Apache 2.0 — see LICENSE.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for surogate

Similar Open Source Tools

For similar tasks

For similar jobs