SimAI

SimAI

None

Stars: 215

Visit
 screenshot

SimAI is the industry's first full-stack, high-precision simulator for AI large-scale training. It provides detailed modeling and simulation of the entire LLM training process, encompassing framework, collective communication, network layers, and more. This comprehensive approach offers end-to-end performance data, enabling researchers to analyze training process details, evaluate time consumption of AI tasks under specific conditions, and assess performance gains from various algorithmic optimizations.

README:

Lastest News

🎯 Events & Community Engagement 🎯

πŸ“… Upcoming Events

Date Event Location Content Type
Dec 27, 2024 SimAI Technical Presentation πŸ“ Beihang University SimAI Technical Sharing & Discussion πŸŽ“ On-site

🌟 Past Events

Date Event Location Content Type
Dec 6, 2024 HKUST Technical Workshop πŸ“ HKUST(GZ) SimAI Technical Sharing & Discussion πŸŽ“ On-site
Dec 5, 2024 Bench'24 Conference πŸ“ Guangzhou SimAI Tutorial & Deep-dive Session πŸŽ“ On-site
Nov 26, 2024 SimAI Community Live Stream 🌐 Online Interactive Technical Discussion & Demo (400+ Attendees) πŸ’» Virtual
Nov 15, 2024 Technical Workshop πŸ“ Thousand Island Lake SimAI Offline Technical Exchange 🎯 On-site
Oct 18, 2024 Guest Lecture πŸ“ Fudan University SimAI Tutorial & Public Course πŸŽ“ On-site
Sept 24-26, 2024 CCF HPC China 2024 πŸ“ Wuhan SimAI Introduction & Technical Presentation 🎀 Conference

Table of Contents

SimAI Overview

Introduction

SimAI is the industry's first full-stack, high-precision Simulator for AI large-scale training. It provides detailed modeling and simulation of the entire LLM training process, encompassing framework, collective communication, network layers, and more. This comprehensive approach offers end-to-end performance data, enabling researchers to:

  • Analyze training process details
  • Evaluate the time consumption of AI tasks under specific conditions
  • Evaluate E2E performance gains from various algorithmic optimizations including:
    • Framework parameters settings
    • Collective communication algorithms
    • NCCL environment variables
    • Network transmission protocols
    • Congestion control algorithms
    • Adaptive routing algorithms
    • Scale-up/out network topology modifications
    • ...

Components

        |--- AICB
SimAI --|--- SimCCL
        |--- astra-sim-alibabacloud
        |--- ns-3-alibabacloud

Building on pure simulation capabilities, SimAI has evolved into a versatile full-stack toolkit comprising four components (aicb, SimCCL, astra-sim-alibabacloud, ns-3-alibabacloud). These components can be combined in various ways to achieve different functionalities. Below, we present the six main usage scenarios for SimAI. We encourage users to explore even more possibilities with this powerful tool.

Below is the architecture diagram of the SimAI Simulator: SimAI_Arc

astra-sim-alibabacloud is extended from astra-sim. We are grateful to the astra-sim team for their excellent work and open-source contribution. We have integrated NCCL algorithms and added some new features.

Scenario

SimAI supports three major operation modes to meet different simulation requirements:

SimAI-Analytical offers fast simulation by abstracting network communication details using bus bandwidth (busbw) to estimate collective communication time. While it currently supports user-defined busbw, automatic busbw calculation feature is coming soon.

SimAI-Simulation provides full-stack simulation with fine-grained network communication modeling. It leverages NS3 or other network simulators (NS3 currently open-sourced) to achieve detailed simulation of all communication behaviors, aiming for high-fidelity reproduction of actual training environments.

SimAI-Physical (Beta) enables physical traffic generation for CPU RDMA cluster environments. This mode generates NCCL-like traffic patterns, allowing in-depth study of NIC behaviors during LLM training. It is currently in internal testing phase.

Scenario Description Component Combination
1. AICB Test Suite Run communication patterns on GPU clusters using AICB Test suite AICB
2. AICB/AIOB Workload Model compute/communication patterns of training process to generate workload AICB
3. Collective Comm Analyze Break down collective communication operations into point-to-point communication sets SimCCL
4. Collective Comm w/o GPU Perform RDMA collective communication traffic on non-GPU clusters AICB + SimCCL + astra-sim-alibabacloud(physical)
5. SimAI-Analytical Conduct rapid AICB workload analysis and simulation on any server (ignoring underlying network details) AICB + astra-sim-alibabacloud(analytical)
6. SimAI-Simulation Perform full simulation on any server AICB + SimCCL + astra-sim-alibabacloud(simulation) + ns-3-alibabacloud

Citation

SimAI work has been accepted by NSDI'25 Spring, for more details, please refer to our paper below:

SimAI: Unifying Architecture Design and Performance Tunning for Large-Scale Large Language Model Training with Scalability and Precision.

[pdf] / [slides] / [video]

We encourage innovative research and extensions based on SimAI. Welcome to join our community group or reach out via email for discussion. We may provide technical support.

Quick Start

Here are some simple examples, SimAI full tutorials can be found here: SimAI@Tutorial, aicb@Tutorial, [SimCCL@Tutorial], [ns-3-alibabacloud@Tutorial]

Setup

You can follow the instrucitons below to quickly set up the environtments and run SimAI

From Source Code

The following code has been successfully tested on GCC/G++ 9.4.0, python 3.8.10 in Ubuntu 20.04

You can use the official Ubuntu 20.04 image, and do not install ninja.

(For generation workloads, it's recommended to leverage NGC container images directly.)

# Clone the repository
$ git clone https://github.com/aliyun/SimAI.git
$ cd ./SimAI/

# Clone submodules
$ git submodule update --init --recursive
# Make sure use the newest commit
$ git submodule update --remote

# Compile SimAI-Analytical
$ ./scripts/build.sh -c analytical

# Compile SimAI-Simulation (ns3)
$ ./scripts/build.sh -c ns3

Use SimAI-Analytical

$  ./bin/SimAI_analytical -w example/workload_analytical.txt -g 9216 -g_p_s 8 -r test- -busbw example/busbw.yaml

Use SimAI-Simulation

# Create network topo
$ python3 ./astra-sim-alibabacloud/inputs/topo/gen_HPN_7.0_topo_mulgpus_one_link.py -g 128 -gt A100 -bw 100Gbps -nvbw 2400Gbps

# Running
$ AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 16 -w ./example/microAllReduce.txt -n ./HPN_7_0_128_gpus_8_in_one_server_with_single_plane_100Gbps_A100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf

Contact us

Please email Gang Lu ([email protected]) or Qingxu Li ([email protected]) if you have any questions.

Welcome to join the SimAI community chat groups, with the DingTalk group on the left and the WeChat group on the right.

SimAI DingTalk SimAI WeChat

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for SimAI

Similar Open Source Tools

For similar tasks

For similar jobs