ai-performance-engineering

None

Stars: 135

Visit

This repository is a comprehensive resource for AI Systems Performance Engineering, providing code examples, tools, and resources for GPU optimization, distributed training, inference scaling, and performance tuning. It covers a wide range of topics such as performance tuning mindset, system architecture, GPU programming, memory optimization, and the latest profiling tools. The focus areas include GPU architecture, PyTorch, CUDA programming, distributed training, memory optimization, and multi-node scaling strategies.

README:

AI Performance Engineering

About This Repository

This repository contains comprehensive code examples, tools, and resources for AI Systems Performance Engineering. It accompanies the O'Reilly book covering GPU optimization, distributed training, inference scaling, and performance tuning for modern AI workloads.

O'Reilly Book - Fall 2025
Available on Amazon

AI Systems Performance Engineering Book Checklist

The book includes a comprehensive 175+ item performance checklist covering:

✅ Performance Tuning Mindset and Cost Optimization
✅ Reproducibility and Documentation Best Practices
✅ System Architecture and Hardware Planning
✅ Operating System and Driver Optimizations
✅ GPU Programming and CUDA Tuning
✅ Distributed Training and Network Optimization
✅ Efficient Inference and Serving
✅ Power and Thermal Management
✅ Latest Profiling Tools and Techniques
✅ Architecture-Specific Optimizations

Quick Start

Prerequisites

NVIDIA GPU with CUDA support
Python 3.8+
PyTorch with CUDA
Docker (optional)

Getting Started

# Clone the repository
git clone https://github.com/your-repo/ai-performance-engineering.git
cd ai-performance-engineering

# Install dependencies for a specific chapter
cd code/ch1
pip install -r requirements.txt

# Run examples
python performance_basics.py

# Profiling-friendly workloads
Most examples use modest tensor sizes and short iteration counts so Nsight and the PyTorch profiler finish in seconds. Comments inside each script highlight these adjustments; increase the sizes if you need larger-scale numbers.

Blackwell Workflow

This repository now targets a single architecture profile: NVIDIA Blackwell B200/B300 (SM100). All tooling, CUDA builds, and PyTorch examples assume CUDA 12.9, PyTorch 2.9 nightlies, and Triton 3.4. Use the helper scripts to stay aligned with that stack:

# Build CUDA samples and run sanity checks
./code/build_all.sh

# Profile the entire codebase with Nsight + PyTorch profiler
python code/profiler_scripts/profile_harness.py --profile nsys --profile pytorch --output-root profiles/full_run

# Reset all generated profiling artefacts
./clean_profiles.sh

For hardware details and optimisation notes, see code/README.md.

Latest Features

Updated for PyTorch 2.9, CUDA 12.9, and Triton 3.4:

PyTorch 2.9: Enhanced compiler, dynamic shapes, improved profiler
CUDA 12.9: Latest CUDA features, improved kernel performance
Triton 3.4: Latest Triton optimizations, architecture-specific kernels
Enhanced Profiling: Nsight Systems 2024.1, Nsight Compute 2024.1
HTA: Holistic Tracing Analysis for multi-GPU systems
Perf: Enhanced system-level analysis
Architecture Optimizations: Blackwell-specific features
Unified Profiling Harness: One command walks through every chapter with Nsight Systems/Compute + PyTorch profiler

Book Chapters Overview

Chapter 1: Introduction and AI System Overview

The AI Systems Performance Engineer
Benchmarking and Profiling
Scaling Distributed Training and Inference
Managing Resources Efficiently
Cross-Team Collaboration
Transparency and Reproducibility

Chapter 2: AI System Hardware Overview

The CPU and GPU "Superchip"
NVIDIA Grace CPU & Blackwell GPU
NVIDIA GPU Tensor Cores and Transformer Engine
Streaming Multiprocessors, Threads, and Warps
Ultra-Scale Networking
NVLink and NVSwitch
Multi-GPU Programming

Chapter 3: OS, Docker, and Kubernetes Tuning

Operating System Configuration
GPU Driver and Software Stack
NUMA Awareness and CPU Pinning
Container Runtime Optimizations
Kubernetes for Topology-Aware Orchestration
Memory Isolation and Resource Management

Chapter 4: Tuning Distributed Networking Communication

Overlapping Communication and Computation
NCCL for Distributed Multi-GPU Communication
Topology Awareness in NCCL
Distributed Data Parallel Strategies
NVIDIA Inference Transfer Library (NIXL)
In-Network SHARP Aggregation

Chapter 5: GPU-based Storage I/O Optimizations

Fast Storage and Data Locality
NVIDIA GPUDirect Storage
Distributed, Parallel File Systems
Multi-Modal Data Processing with NVIDIA DALI
Creating High-Quality LLM Datasets

Chapter 6: GPU Architecture, CUDA Programming, and Maximizing Occupancy

Understanding GPU Architecture
Threads, Warps, Blocks, and Grids
CUDA Programming Refresher
Understanding GPU Memory Hierarchy
Maintaining High Occupancy and GPU Utilization
Roofline Model Analysis

Chapter 7: Profiling and Tuning GPU Memory Access Patterns

Coalesced vs. Uncoalesced Global Memory Access
Vectorized Memory Access
Tiling and Data Reuse Using Shared Memory
Warp Shuffle Intrinsics
Asynchronous Memory Prefetching

Chapter 8: Occupancy Tuning, Warp Efficiency, and Instruction-Level Parallelism

Profiling and Diagnosing GPU Bottlenecks
Nsight Systems and Compute Analysis
Tuning Occupancy
Improving Warp Execution Efficiency
Exposing Instruction-Level Parallelism

Chapter 9: Increasing CUDA Kernel Efficiency and Arithmetic Intensity

Multi-Level Micro-Tiling
Kernel Fusion
Mixed Precision and Tensor Cores
Using CUTLASS for Optimal Performance
Inline PTX and SASS Tuning

Chapter 10: Intra-Kernel Pipelining and Cooperative Thread Block Clusters

Intra-Kernel Pipelining Techniques
Warp-Specialized Producer-Consumer Model
Persistent Kernels and Megakernels
Thread Block Clusters and Distributed Shared Memory
Cooperative Groups

Chapter 11: Inter-Kernel Pipelining and CUDA Streams

Using Streams to Overlap Compute with Data Transfers
Stream-Ordered Memory Allocator
Fine-Grained Synchronization with Events
Zero-Overhead Launch with CUDA Graphs

Chapter 12: Dynamic and Device-Side Kernel Orchestration

Dynamic Scheduling with Atomic Work Queues
Batch Repeated Kernel Launches with CUDA Graphs
Dynamic Parallelism
Orchestrate Across Multiple GPUs with NVSHMEM

Chapter 13: Profiling, Tuning, and Scaling PyTorch

NVTX Markers and Profiling Tools
PyTorch Compiler (torch.compile)
Profiling and Tuning Memory in PyTorch
Scaling with PyTorch Distributed
Multi-GPU Profiling with HTA

Chapter 14: PyTorch Compiler, XLA, and OpenAI Triton Backends

PyTorch Compiler Deep Dive
Writing Custom Kernels with OpenAI Triton
PyTorch XLA Backend
Advanced Triton Kernel Implementations

Chapter 15: Multi-Node Inference Parallelism and Routing

Disaggregated Prefill and Decode Architecture
Parallelism Strategies for MoE Models
Speculative and Parallel Decoding Techniques
Dynamic Routing Strategies

Chapter 16: Profiling, Debugging, and Tuning Inference at Scale

Workflow for Profiling and Tuning Performance
Dynamic Request Batching and Scheduling
Systems-Level Optimizations
Quantization Approaches for Real-Time Inference
Application-Level Optimizations

Chapter 17: Scaling Disaggregated Prefill and Decode

Prefill-Decode Disaggregation Benefits
Prefill Workers Design
Decode Workers Design
Disaggregated Routing and Scheduling Policies
Scalability Considerations

Chapter 18: Advanced Prefill-Decode and KV Cache Tuning

Optimized Decode Kernels (FlashMLA, ThunderMLA, FlexDecoding)
Tuning KV Cache Utilization and Management
Heterogeneous Hardware and Parallelism Strategies
SLO-Aware Request Management

Chapter 19: Dynamic and Adaptive Inference Engine Optimizations

Adaptive Parallelism Strategies
Dynamic Precision Changes
Kernel Auto-Tuning
Reinforcement Learning Agents for Runtime Tuning
Adaptive Batching and Scheduling

Chapter 20: AI-Assisted Performance Optimizations

AlphaTensor AI-Discovered Algorithms
Automated GPU Kernel Optimizations
Self-Improving AI Agents
Scaling Toward Multi-Million GPU Clusters

Tools and Utilities

Profiling Scripts

code/profiler_scripts/profile_harness.py - Unified Nsight Systems / Nsight Compute / PyTorch profiler runner
code/profiler_scripts/enhanced_profiling.sh - Convenience wrapper for individual scripts
code/profiler_scripts/hta_profile.sh - Holistic Tracing Analysis

Performance Analysis Tools

tools/comprehensive_profiling.py - Python-based profiling utilities
tools/compare_nsight/ - Nsight Systems comparison tools
tools/inference_gpu_cluster_sizing/ - Cluster sizing notebooks

Enhanced Profiling Commands

# Comprehensive profiling
nsys profile -t cuda,nvtx,osrt,triton -o timeline_profile python script.py

# Kernel analysis
ncu --metrics achieved_occupancy,warp_execution_efficiency -o kernel_profile python script.py

# HTA for multi-GPU
nsys profile -t cuda,nvtx,osrt,cudnn,cublas,nccl,triton -o hta_profile python script.py

# System analysis
perf record -g -p $(pgrep python) -o perf.data
perf report -i perf.data

Community Resources

Monthly Meetups (100,000+ Global Members, 20+ Cities)

Meetup Group: AI Performance Engineering
YouTube Channel: AI Performance Engineering

Recent Meetups

Sept 15, 2025

YouTube Video

Aug 18, 2025

YouTube Video

July 21, 2025

June 16, 2025

High Performance Agentic AI Inference Systems

May 19, 2025

April 21, 2025

Contributing

We welcome contributions! Please see our Contributing Guide for:

Code examples and improvements
Documentation updates
Performance optimization techniques
Bug reports and feature requests

License

This project is licensed under the MIT License - see the LICENSE file for details.

For Tasks:

Click tags to check more tools for each tasks

optimize gpu performance scale distributed training tune pytorch performance profile system architecture manage power and thermal efficiency

For Jobs:

ai systems performance engineer gpu performance engineer ai infrastructure engineer machine learning engineer performance optimization specialist

Alternative AI tools for ai-performance-engineering

Similar Open Source Tools

ai-performance-engineering

github

: 135

finite-monkey-engine

FiniteMonkey is an advanced vulnerability mining engine powered purely by GPT, requiring no prior knowledge base or fine-tuning. Its effectiveness significantly surpasses most current related research approaches. The tool is task-driven, prompt-driven, and focuses on prompt design, leveraging 'deception' and hallucination as key mechanics. It has helped identify vulnerabilities worth over $60,000 in bounties. The tool requires PostgreSQL database, OpenAI API access, and Python environment for setup. It supports various languages like Solidity, Rust, Python, Move, Cairo, Tact, Func, Java, and Fake Solidity for scanning. FiniteMonkey is best suited for logic vulnerability mining in real projects, not recommended for academic vulnerability testing. GPT-4-turbo is recommended for optimal results with an average scan time of 2-3 hours for medium projects. The tool provides detailed scanning results guide and implementation tips for users.

github

: 305

claude-007-agents

Claude Code Agents is an open-source AI agent system designed to enhance development workflows by providing specialized AI agents for orchestration, resilience engineering, and organizational memory. These agents offer specialized expertise across technologies, AI system with organizational memory, and an agent orchestration system. The system includes features such as engineering excellence by design, advanced orchestration system, Task Master integration, live MCP integrations, professional-grade workflows, and organizational intelligence. It is suitable for solo developers, small teams, enterprise teams, and open-source projects. The system requires a one-time bootstrap setup for each project to analyze the tech stack, select optimal agents, create configuration files, set up Task Master integration, and validate system readiness.

github

: 159

opcode

opcode is a powerful desktop application built with Tauri 2 that serves as a command center for interacting with Claude Code. It offers a visual GUI for managing Claude Code sessions, creating custom agents, tracking usage, and more. Users can navigate projects, create specialized AI agents, monitor usage analytics, manage MCP servers, create session checkpoints, edit CLAUDE.md files, and more. The tool bridges the gap between command-line tools and visual experiences, making AI-assisted development more intuitive and productive.

github

: 15.8k

llamafarm

LlamaFarm is a comprehensive AI framework that empowers users to build powerful AI applications locally, with full control over costs and deployment options. It provides modular components for RAG systems, vector databases, model management, prompt engineering, and fine-tuning. Users can create differentiated AI products without needing extensive ML expertise, using simple CLI commands and YAML configs. The framework supports local-first development, production-ready components, strategy-based configuration, and deployment anywhere from laptops to the cloud.

github

: 115

lyraios

LYRAIOS (LLM-based Your Reliable AI Operating System) is an advanced AI assistant platform built with FastAPI and Streamlit, designed to serve as an operating system for AI applications. It offers core features such as AI process management, memory system, and I/O system. The platform includes built-in tools like Calculator, Web Search, Financial Analysis, File Management, and Research Tools. It also provides specialized assistant teams for Python and research tasks. LYRAIOS is built on a technical architecture comprising FastAPI backend, Streamlit frontend, Vector Database, PostgreSQL storage, and Docker support. It offers features like knowledge management, process control, and security & access control. The roadmap includes enhancements in core platform, AI process management, memory system, tools & integrations, security & access control, open protocol architecture, multi-agent collaboration, and cross-platform support.

github

: 202

bifrost

Bifrost is a high-performance AI gateway that unifies access to multiple providers through a single OpenAI-compatible API. It offers features like automatic failover, load balancing, semantic caching, and enterprise-grade functionalities. Users can deploy Bifrost in seconds with zero configuration, benefiting from its core infrastructure, advanced features, enterprise and security capabilities, and developer experience. The repository structure is modular, allowing for maximum flexibility. Bifrost is designed for quick setup, easy configuration, and seamless integration with various AI models and tools.

github

: 615

AGiXT

AGiXT is a dynamic Artificial Intelligence Automation Platform engineered to orchestrate efficient AI instruction management and task execution across a multitude of providers. Our solution infuses adaptive memory handling with a broad spectrum of commands to enhance AI's understanding and responsiveness, leading to improved task completion. The platform's smart features, like Smart Instruct and Smart Chat, seamlessly integrate web search, planning strategies, and conversation continuity, transforming the interaction between users and AI. By leveraging a powerful plugin system that includes web browsing and command execution, AGiXT stands as a versatile bridge between AI models and users. With an expanding roster of AI providers, code evaluation capabilities, comprehensive chain management, and platform interoperability, AGiXT is consistently evolving to drive a multitude of applications, affirming its place at the forefront of AI technology.

github

: 3.1k

evi-run

evi-run is a powerful, production-ready multi-agent AI system built on Python using the OpenAI Agents SDK. It offers instant deployment, ultimate flexibility, built-in analytics, Telegram integration, and scalable architecture. The system features memory management, knowledge integration, task scheduling, multi-agent orchestration, custom agent creation, deep research, web intelligence, document processing, image generation, DEX analytics, and Solana token swap. It supports flexible usage modes like private, free, and pay mode, with upcoming features including NSFW mode, task scheduler, and automatic limit orders. The technology stack includes Python 3.11, OpenAI Agents SDK, Telegram Bot API, PostgreSQL, Redis, and Docker & Docker Compose for deployment.

github

: 74

template-repo

The template-repo is a comprehensive development ecosystem with 6 AI agents, 14 MCP servers, and complete CI/CD automation running on self-hosted, zero-cost infrastructure. It follows a container-first approach, with all tools and operations running in Docker containers, zero external dependencies, self-hosted infrastructure, single maintainer design, and modular MCP architecture. The repo provides AI agents for development and automation, features 14 MCP servers for various tasks, and includes security measures, safety training, and sleeper detection system. It offers features like video editing, terrain generation, 3D content creation, AI consultation, image generation, and more, with a focus on maximum portability and consistency.

github

: 74

ai-assisted-devops

AI-Assisted DevOps is a 10-day course focusing on integrating artificial intelligence (AI) technologies into DevOps practices. The course covers various topics such as AI for observability, incident response, CI/CD pipeline optimization, security, and FinOps. Participants will learn about running large language models (LLMs) locally, making API calls, AI-powered shell scripting, and using AI agents for self-healing infrastructure. Hands-on activities include creating GitHub repositories with bash scripts, generating Docker manifests, predicting server failures using AI, and building AI agents to monitor deployments. The course culminates in a capstone project where learners implement AI-assisted DevOps automation and receive peer feedback on their projects.

github

: 204

persistent-ai-memory

Persistent AI Memory System is a comprehensive tool that offers persistent, searchable storage for AI assistants. It includes features like conversation tracking, MCP tool call logging, and intelligent scheduling. The system supports multiple databases, provides enhanced memory management, and offers various tools for memory operations, schedule management, and system health checks. It also integrates with various platforms like LM Studio, VS Code, Koboldcpp, Ollama, and more. The system is designed to be modular, platform-agnostic, and scalable, allowing users to handle large conversation histories efficiently.

github

: 138

Zettelgarden

Zettelgarden is a human-centric, open-source personal knowledge management system that helps users develop and maintain their understanding of the world. It focuses on creating and connecting atomic notes, thoughtful AI integration, and scalability from personal notes to company knowledge bases. The project is actively evolving, with features subject to change based on community feedback and development priorities.

github

: 152

Awesome-Mind-Network

Awesome Mind Network is a curated collection of open-source resources, SDKs, and tools by Mind Network, empowering developers and researchers with privacy-preserving technologies, Agentic AI, and decentralized infrastructure.

github

: 197

aegra

Aegra is a self-hosted AI agent backend platform that provides LangGraph power without vendor lock-in. Built with FastAPI + PostgreSQL, it offers complete control over agent orchestration for teams looking to escape vendor lock-in, meet data sovereignty requirements, enable custom deployments, and optimize costs. Aegra is Agent Protocol compliant and perfect for teams seeking a free, self-hosted alternative to LangGraph Platform with zero lock-in, full control, and compatibility with existing LangGraph Client SDK.

github

: 137

aigne-hub

AIGNE Hub is a unified AI gateway that manages connections to multiple LLM and AIGC providers, eliminating the complexity of handling API keys, usage tracking, and billing across different AI services. It provides self-hosting capabilities, multi-provider management, unified security, usage analytics, flexible billing, and seamless integration with the AIGNE framework. The tool supports various AI providers and deployment scenarios, catering to both enterprise self-hosting and service provider modes. Users can easily deploy and configure AI providers, enable billing, and utilize core capabilities such as chat completions, image generation, embeddings, and RESTful APIs. AIGNE Hub ensures secure access, encrypted API key management, user permissions, and audit logging. Built with modern technologies like AIGNE Framework, Node.js, TypeScript, React, SQLite, and Blocklet for cloud-native deployment.

github

: 387

For similar tasks

ai-performance-engineering

github

: 135

nvidia_gpu_exporter

Nvidia GPU exporter for prometheus, using `nvidia-smi` binary to gather metrics.

github

: 1.3k

For similar jobs

llm-resource

llm-resource is a comprehensive collection of high-quality resources for Large Language Models (LLM). It covers various aspects of LLM including algorithms, training, fine-tuning, alignment, inference, data engineering, compression, evaluation, prompt engineering, AI frameworks, AI basics, AI infrastructure, AI compilers, LLM application development, LLM operations, AI systems, and practical implementations. The repository aims to gather and share valuable resources related to LLM for the community to benefit from.

github

: 309

LitServe

LitServe is a high-throughput serving engine designed for deploying AI models at scale. It generates an API endpoint for models, handles batching, streaming, and autoscaling across CPU/GPUs. LitServe is built for enterprise scale with a focus on minimal, hackable code-base without bloat. It supports various model types like LLMs, vision, time-series, and works with frameworks like PyTorch, JAX, Tensorflow, and more. The tool allows users to focus on model performance rather than serving boilerplate, providing full control and flexibility.

github

: 3.6k

how-to-optim-algorithm-in-cuda

This repository documents how to optimize common algorithms based on CUDA. It includes subdirectories with code implementations for specific optimizations. The optimizations cover topics such as compiling PyTorch from source, NVIDIA's reduce optimization, OneFlow's elementwise template, fast atomic add for half data types, upsample nearest2d optimization in OneFlow, optimized indexing in PyTorch, OneFlow's softmax kernel, linear attention optimization, and more. The repository also includes learning resources related to deep learning frameworks, compilers, and optimization techniques.

github

: 2.5k

aiac

AIAC is a library and command line tool to generate Infrastructure as Code (IaC) templates, configurations, utilities, queries, and more via LLM providers such as OpenAI, Amazon Bedrock, and Ollama. Users can define multiple 'backends' targeting different LLM providers and environments using a simple configuration file. The tool allows users to ask a model to generate templates for different scenarios and composes an appropriate request to the selected provider, storing the resulting code to a file and/or printing it to standard output.

github

: 3.4k

ENOVA

ENOVA is an open-source service for Large Language Model (LLM) deployment, monitoring, injection, and auto-scaling. It addresses challenges in deploying stable serverless LLM services on GPU clusters with auto-scaling by deconstructing the LLM service execution process and providing configuration recommendations and performance detection. Users can build and deploy LLM with few command lines, recommend optimal computing resources, experience LLM performance, observe operating status, achieve load balancing, and more. ENOVA ensures stable operation, cost-effectiveness, efficiency, and strong scalability of LLM services.

github

: 124

jina

Jina is a tool that allows users to build multimodal AI services and pipelines using cloud-native technologies. It provides a Pythonic experience for serving ML models and transitioning from local deployment to advanced orchestration frameworks like Docker-Compose, Kubernetes, or Jina AI Cloud. Users can build and serve models for any data type and deep learning framework, design high-performance services with easy scaling, serve LLM models while streaming their output, integrate with Docker containers via Executor Hub, and host on CPU/GPU using Jina AI Cloud. Jina also offers advanced orchestration and scaling capabilities, a smooth transition to the cloud, and easy scalability and concurrency features for applications. Users can deploy to their own cloud or system with Kubernetes and Docker Compose integration, and even deploy to JCloud for autoscaling and monitoring.

github

: 21.0k

vidur

Vidur is a high-fidelity and extensible LLM inference simulator designed for capacity planning, deployment configuration optimization, testing new research ideas, and studying system performance of models under different workloads and configurations. It supports various models and devices, offers chrome trace exports, and can be set up using mamba, venv, or conda. Users can run the simulator with various parameters and monitor metrics using wandb. Contributions are welcome, subject to a Contributor License Agreement and adherence to the Microsoft Open Source Code of Conduct.

github

: 241

AI-System-School

AI System School is a curated list of research in machine learning systems, focusing on ML/DL infra, LLM infra, domain-specific infra, ML/LLM conferences, and general resources. It provides resources such as data processing, training systems, video systems, autoML systems, and more. The repository aims to help users navigate the landscape of AI systems and machine learning infrastructure, offering insights into conferences, surveys, books, videos, courses, and blogs related to the field.

github

: 2.6k

ai-performance-engineering

README:

AI Performance Engineering

About This Repository

AI Systems Performance Engineering Book Checklist

Links

Key Focus Areas

Quick Start

Prerequisites

Getting Started

Blackwell Workflow

Latest Features

Book Chapters Overview

Chapter 1: Introduction and AI System Overview

Chapter 2: AI System Hardware Overview

Chapter 3: OS, Docker, and Kubernetes Tuning

Chapter 4: Tuning Distributed Networking Communication

Chapter 5: GPU-based Storage I/O Optimizations

Chapter 6: GPU Architecture, CUDA Programming, and Maximizing Occupancy

Chapter 7: Profiling and Tuning GPU Memory Access Patterns

Chapter 8: Occupancy Tuning, Warp Efficiency, and Instruction-Level Parallelism

Chapter 9: Increasing CUDA Kernel Efficiency and Arithmetic Intensity

Chapter 10: Intra-Kernel Pipelining and Cooperative Thread Block Clusters

Chapter 11: Inter-Kernel Pipelining and CUDA Streams

Chapter 12: Dynamic and Device-Side Kernel Orchestration

Chapter 13: Profiling, Tuning, and Scaling PyTorch

Chapter 14: PyTorch Compiler, XLA, and OpenAI Triton Backends

Chapter 15: Multi-Node Inference Parallelism and Routing

Chapter 16: Profiling, Debugging, and Tuning Inference at Scale

Chapter 17: Scaling Disaggregated Prefill and Decode

Chapter 18: Advanced Prefill-Decode and KV Cache Tuning

Chapter 19: Dynamic and Adaptive Inference Engine Optimizations

Chapter 20: AI-Assisted Performance Optimizations

Tools and Utilities

Profiling Scripts

Performance Analysis Tools

Enhanced Profiling Commands

Community Resources

Monthly Meetups (100,000+ Global Members, 20+ Cities)

Recent Meetups

Sept 15, 2025

Aug 18, 2025

July 21, 2025

June 16, 2025

May 19, 2025

April 21, 2025

Contributing

License

For Tasks:

For Jobs:

Alternative AI tools for ai-performance-engineering

Similar Open Source Tools

ai-performance-engineering

finite-monkey-engine

claude-007-agents

opcode

llamafarm

lyraios

bifrost

AGiXT

evi-run

template-repo

ai-assisted-devops

persistent-ai-memory

Zettelgarden

Awesome-Mind-Network

aegra

aigne-hub

For similar tasks

ai-performance-engineering

nvidia_gpu_exporter

For similar jobs

llm-resource

LitServe

how-to-optim-algorithm-in-cuda

aiac

ENOVA

jina

vidur

AI-System-School