
tt-forge
Tenstorrent's MLIR Based Compiler. We aim to enable developers to run AI on all configurations of Tenstorrent hardware, through an open-source, general, and performant compiler.
Stars: 104

TT-Forge is Tenstorrent's MLIR-based compiler that integrates into various compiler technologies from AI/ML frameworks to enable running models and custom kernel generation. It aims to provide abstraction of different frontend frameworks, compile various model architectures with good performance, and abstract all Tenstorrent device architectures. The repository serves as the central hub for the tt-forge compiler project, bringing together sub-projects into a cohesive product with releases, demos, model support, roadmaps, and key resources. Users can explore the documentation for individual front ends to get started running tests and demos.
README:
TT-Forge is Tenstorrent's MLIR-based compiler. It integrates into various compiler technologies from AI/ML frameworks, to both enable running models and create custom kernel generation.
- Getting Started / How to Run a Model
- Interactive Tenstorrent Software Diagram
- TT-Forge-FE
- TT-Torch
- TT-XLA
- TT-MLIR
- TT-Metal
- TT-TVM
This repository is the central hub for the tt-forge compiler project, bringing together its various sub-projects into a cohesive product. Here, you'll find releases, demos, model support, roadmaps, and other key resources as the project evolves. Please file any issues with questions or feedback you may have here.
See the documentation available for individual front ends in the Front End section to get started running some tests. You can also try running a demo using the TT-Forge Getting Started page.
- Provide abstraction of many different frontend frameworks
- Generically compile many kinds of model architectures without modification and with good performance
- Abstract all Tenstorrent device architectures
tt-forge is composed of various projects ranging from front ends to support popular third-party AI Frameworks, MLIR compiler project, performance optimizations and tools to support the project. tt-forge lowers to our tt-metalium project, providing additional functionality to our AI Sofware ecosystem.
Overview of Tenstorrent's Opensource AI software ecosystem. Click on components to navigate to their repositories:
flowchart TD
%% Define styles for the diagram with improved contrast and font size
classDef frameworks fill:#f9d6d2,stroke:#e05d44,stroke-width:2px,color:#000000,font-size:14px,font-weight:bold
classDef frontends fill:#fff3cd,stroke:#ffc107,stroke-width:2px,color:#000000,font-size:14px,font-weight:bold
classDef compiler fill:#d1e7dd,stroke:#198754,stroke-width:2px,color:#000000,font-size:14px,font-weight:bold
classDef runtime fill:#cfe2ff,stroke:#0d6efd,stroke-width:2px,color:#000000,font-size:14px,font-weight:bold
classDef system fill:#e2e3e5,stroke:#6c757d,stroke-width:2px,color:#000000,font-size:14px,font-weight:bold
classDef software fill:#d3d3ff,stroke:#6610f2,stroke-width:2px,color:#000000,font-size:14px,font-weight:bold
classDef hardware fill:#f8f9fa,stroke:#212529,stroke-width:2px,color:#000000,font-size:14px,font-weight:bold
classDef invisible opacity:0,fill:none,stroke:none
%% Top level layout with invisible container to center frameworks
subgraph TopLevel[" "]
direction LR
%% Left spacer (invisible)
LeftSpacer[" "]:::invisible
%% Center container for frameworks
subgraph FrameworksContainer[" "]
direction TB
%% Top level frameworks
subgraph Frameworks["<span style='font-size:16px;font-weight:bold'>Frameworks</span>"]
direction LR
JAX("<span style='font-size:14px;font-weight:bold'>JAX</span>")
ONX("<span style='font-size:14px;font-weight:bold'>ONNX</span>")
PYTORCH("<span style='font-size:14px;font-weight:bold'>PyTorch</span>")
TF("<span style='font-size:14px;font-weight:bold'>TensorFlow</span>")
end
%% Front-ends
subgraph FrontEnds["<span style='font-size:16px;font-weight:bold'>Front Ends</span>"]
direction LR
%% Add extra spacing between frontend components
TT_TORCH("<span style='font-size:14px;font-weight:bold'>tt-torch</span>")
TT_FORGE_FE("<span style='font-size:14px;font-weight:bold'>tt-forge-FE</span>")
TT_XLA("<span style='font-size:14px;font-weight:bold'>tt-xla</span>")
end
end
%% Right spacer (invisible)
RightSpacer[" "]:::invisible
end
%% Style invisible containers
TopLevel:::invisible
FrameworksContainer:::invisible
%% Compiler sections side by side
subgraph CompilerLayer["<span style='font-size:16px;font-weight:bold'>Compiler Layer</span>"]
%% tt-MLIR Compiler section
subgraph TTMLIR["<span style='font-size:16px;font-weight:bold'>tt-MLIR Compiler</span>"]
TTIR("<span style='font-size:14px;font-weight:bold'>TT-IR</span>")
STABLEHLO("<span style='font-size:14px;font-weight:bold'>StableHLO-IR</span>")
PYKERNEL("<span style='font-size:14px;font-weight:bold'>PyKernel</span>")
%% Graph Passes - using hexagon shape
GRAPH_PASSES{{"<span style='font-size:14px;font-weight:bold'>Graph Passes</span>"}}
TTMETAL_IR("<span style='font-size:14px;font-weight:bold'>TT-Metal-IR</span>")
TTNN("<span style='font-size:14px;font-weight:bold'>TTNN-IR</span>")
TTKERNEL("<span style='font-size:14px;font-weight:bold'>TTKernel-IR</span>")
%% Connect PyKernel to Graph Passes
PYKERNEL --> GRAPH_PASSES
%% Connect Graph Passes to IRs
GRAPH_PASSES --> TTKERNEL
GRAPH_PASSES --> TTNN
GRAPH_PASSES --> TTMETAL_IR
end
%% Compiler Tools section with vertical layout
subgraph CompilerTools["<span style='font-size:16px;font-weight:bold'>Compiler Tools</span>"]
direction TB
TTMLIROPT("<span style='font-size:14px;font-weight:bold'>ttmlir-opt</span>")
TTNNSTANDALONE("<span style='font-size:14px;font-weight:bold'>ttnn-standalone</span>")
TTEXPLORER("<span style='font-size:14px;font-weight:bold'>tt-explorer</span>")
end
end
%% Set direction for compiler sections to be side by side
CompilerLayer:::none
TTMLIR --- CompilerTools
%% TT-Metalium section with Tools
subgraph MetaliumLayer["<span style='font-size:16px;font-weight:bold'>Metalium Layer</span>"]
%% TT-Metalium section
subgraph TTMETALIUM["<span style='font-size:16px;font-weight:bold'>TT-Metalium</span>"]
TTNN_HW("<span style='font-size:14px;font-weight:bold'>TTNN</span>")
TTMETAL("<span style='font-size:14px;font-weight:bold'>TTMetal</span>")
%% Connect TTNN to TTMetal within TT-Metalium
TTNN_HW --> TTMETAL
end
%% Metalium Tools section with vertical layout
subgraph MetaliumTools["<span style='font-size:16px;font-weight:bold'>Metalium Tools</span>"]
direction TB
TRACY("<span style='font-size:14px;font-weight:bold'>tracy</span>")
TTNPE("<span style='font-size:14px;font-weight:bold'>tt-npe</span>")
TTNNVISUALIZER("<span style='font-size:14px;font-weight:bold'>ttnn-visualizer</span>")
end
end
%% Set direction for Metalium sections to be side by side
MetaliumLayer:::none
TTMETALIUM --- MetaliumTools
%% LLK outside of TT-Metalium
LLK("<span style='font-size:14px;font-weight:bold'>LLK</span>")
%% System Tools and System Software sections side by side
subgraph SystemLayer["<span style='font-size:16px;font-weight:bold'>System Layer</span>"]
%% System Tools section
subgraph SystemTools["<span style='font-size:16px;font-weight:bold'>System Tools</span>"]
TTSMI("<span style='font-size:14px;font-weight:bold'>tt-smi</span>")
LUWEN("<span style='font-size:14px;font-weight:bold'>luwen</span>")
TTTOPOLOGY("<span style='font-size:14px;font-weight:bold'>tt-topology</span>")
end
%% System Software section
subgraph SystemSoftware["<span style='font-size:16px;font-weight:bold'>System Software</span>"]
UMD("<span style='font-size:14px;font-weight:bold'>UMD</span>")
KMD("<span style='font-size:14px;font-weight:bold'>KMD</span>")
end
end
%% Set direction for system sections to be side by side
SystemLayer:::none
%% Hardware section
subgraph Hardware["<span style='font-size:16px;font-weight:bold'>Hardware</span>"]
WORMHOLE("<span style='font-size:14px;font-weight:bold'>Wormhole</span>")
BLACKHOLE("<span style='font-size:14px;font-weight:bold'>Blackhole</span>")
end
%% Connect TTMetal to LLK, LLK to System Software, and System Layer to Hardware
TTMETAL --> LLK
LLK --> SystemSoftware
SystemLayer --> Hardware
%% Connect frameworks to front-ends with longer arrows
ONX -.-> TT_TORCH
ONX -.-> TT_FORGE_FE
JAX -.-> TT_XLA
PYTORCH -.-> TT_TORCH
PYTORCH -.-> TT_FORGE_FE
TF -.-> TT_FORGE_FE
%% Connect front-ends to tt-MLIR Compiler
TT_TORCH --> STABLEHLO
TT_XLA --> STABLEHLO
TT_FORGE_FE --> TTIR
%% Connect tt-MLIR Compiler components
STABLEHLO --> TTIR
TTIR --> GRAPH_PASSES
%% Connect IRs to hardware
TTNN --> TTNN_HW
TTMETAL_IR --> TTMETAL
TTKERNEL --> TTMETALIUM
%% Apply styles
class ONX,JAX,PYTORCH,TF frameworks
class TT_TORCH,TT_XLA,TT_FORGE_FE frontends
class TTIR,TTKERNEL,TTNN,TTMETAL_IR,GRAPH_PASSES,PYKERNEL,TTMLIROPT,TTNNSTANDALONE,TTEXPLORER compiler
class TTMETAL,TTNN_HW,LLK,TRACY,TTNPE,TTNNVISUALIZER runtime
class TTSMI,LUWEN,TTTOPOLOGY system
class UMD,KMD software
class WORMHOLE,BLACKHOLE hardware
classDef none opacity:0,fill:none,stroke:none
class LeftSpacer,RightSpacer,TopLevel,FrameworksContainer invisible
%% Add clickable URLs to frontend components
click TT_XLA "https://github.com/tenstorrent/tt-xla" "tt-xla GitHub Repository" _blank
click TT_TORCH "https://github.com/tenstorrent/tt-torch" "tt-torch GitHub Repository" _blank
click TT_FORGE_FE "https://github.com/tenstorrent/tt-forge-fe" "tt-forge-fe GitHub Repository" _blank
%% Add clickable URLs to IR components
click TTKERNEL "https://github.com/tenstorrent/tt-mlir/tree/main/lib/Dialect/TTKernel/IR" "TTKernel-IR GitHub Repository" _blank
click TTIR "https://github.com/tenstorrent/tt-mlir/tree/main/lib/Dialect/TTIR/IR" "TT-IR GitHub Repository" _blank
click TTMETAL_IR "https://github.com/tenstorrent/tt-mlir/tree/main/lib/Dialect/TTMetal/IR" "TT-Metal-IR GitHub Repository" _blank
click TTNN "https://github.com/tenstorrent/tt-mlir/tree/main/lib/Dialect/TTNN/IR" "TTNN-IR GitHub Repository" _blank
click PYKERNEL "https://github.com/tenstorrent/tt-mlir/tree/main/python/pykernel" "PyKernel GitHub Repository" _blank
click STABLEHLO "https://openxla.org/stablehlo/spec" "StableHLO Specification" _blank
%% Add clickable URLs to System Software components
click UMD "https://github.com/tenstorrent/tt-umd" "UMD GitHub Repository" _blank
click KMD "https://github.com/tenstorrent/tt-kmd" "KMD GitHub Repository" _blank
%% Add clickable URLs to System Tools components
click TTSMI "https://github.com/tenstorrent/tt-smi" "tt-smi GitHub Repository" _blank
click LUWEN "https://github.com/tenstorrent/luwen" "luwen GitHub Repository" _blank
click TTTOPOLOGY "https://github.com/tenstorrent/tt-kmd" "tt-topology GitHub Repository" _blank
%% Add clickable URLs to TT-Metalium components
click TTMETAL "https://github.com/tenstorrent/tt-metal" "TTMetal GitHub Repository" _blank
click TTNN_HW "https://github.com/tenstorrent/tt-metal/tree/main/ttnn" "TTNN GitHub Repository" _blank
click LLK "https://github.com/tenstorrent/tt-llk" "LLK GitHub Repository" _blank
%% Add clickable URLs to Metalium Tools components
click TRACY "https://github.com/tenstorrent/tt-metal/tree/main/ttnn/tracy" "tracy GitHub Repository" _blank
click TTNPE "https://github.com/tenstorrent/tt-npe" "tt-npe GitHub Repository" _blank
click TTNNVISUALIZER "https://github.com/tenstorrent/ttnn-visualizer" "ttnn-visualizer GitHub Repository" _blank
%% Add clickable URLs to Compiler Tools components
click TTEXPLORER "https://github.com/tenstorrent/tt-mlir/tree/main/tools/explorer" "tt-explorer GitHub Repository" _blank
click TTNNSTANDALONE "https://github.com/tenstorrent/tt-mlir/tree/main/tools/ttnn-standalone" "ttnn-standalone GitHub Repository" _blank
click TTMLIROPT "https://github.com/tenstorrent/tt-mlir/tree/main/tools/ttmlir-opt" "ttmlir-opt GitHub Repository" _blank
%% Add clickable URLs to Hardware components
click WORMHOLE "https://tenstorrent.com/hardware/wormhole" "Wormhole Hardware Product Page" _blank
click BLACKHOLE "https://tenstorrent.com/hardware/blackhole" "Blackhole Hardware Product Page" _blank
-
- A TVM based graph compiler designed to optimize and transform computational graphs for deep learning models. Supports ingestion of PyTorch, ONNX, TensorFlow, PaddlePaddle and similar ML frameworks via TVM (TT-TVM).
- See the TT-Forge-FE docs pages for an overview and getting started guide.
-
- A MLIR-native, open-source, PyTorch 2.X and torch-mlir based front-end. It provides stableHLO (SHLO) graphs to TT-MLIR. Supports ingestion of PyTorch models via PT2.X compile and ONNX models via torch-mlir (ONNX->SHLO)
- See the TT-Torch docs pages for an overview and getting started guide.
-
- Leverages a PJRT interface to integrate JAX (and in the future other frameworks), TT-MLIR and Tenstorrent hardware. Supports ingestion of JAX models via jit compile, providing StableHLO (SHLO) graph to TT-MLIR compiler
- See the TT-XLA docs pages for an overview and getting started guide.
TT-MLIR Project
At its core TT-MLIR is our compiler that is interfacing with tt-metalium our opens source low level AI Hardware SDK. TT-MLIR provides a solution for optimizing machine learning and other compute workloads for all tenstorrent hardware. TT-MLIR bridges the gap between all different ML Frameworks and Tenstorrent Hardware. TT-MLIR is broken into different dialects:
-
TTIR Dialect: Our common IR that can then be lowered into multiple different backends
-
TTNN Dialect: Our entry point into the TTNN Library of Ops
-
TTMetalium Dialect: Our entry point into directly accessing tt-metalium kernels.
The compiler employs various optimization passes, including layout transformation, operation fusing, decomposition, and sharding, ensuring the efficient lowering to the target dialect.
-
ttmlir-opt: This tool is used to run the TT-MLIR compiler passes on .mlir source files and is central to developing and testing the cmpiler.
-
ttmlir-translate: ttmlir-translate allows us to ingest something (e.g., code) into MLIR compiler, and produce something (e.g., executable binary, or even code again) from MLIR compiler.
-
ttrt: It is a standalone runtime tool that can inspect and run compiler executable files without front-end.
-
tt-explorer: It provides a “Human-In-Loop” interface such that the compiler results can be actively tuned and understood by the person compiling the model.
-
ttnn-standalone: This is a post-compile tuning/debugging tool for C++ TTNN generated code.
This repo is a part of Tenstorrent’s bounty program. If you are interested in helping to improve TT-Forge, please make sure to read the Tenstorrent Bounty Program Terms and Conditions before heading to the issues tab. Look for the issues that are tagged with both “bounty” and difficulty level!
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for tt-forge
Similar Open Source Tools

tt-forge
TT-Forge is Tenstorrent's MLIR-based compiler that integrates into various compiler technologies from AI/ML frameworks to enable running models and custom kernel generation. It aims to provide abstraction of different frontend frameworks, compile various model architectures with good performance, and abstract all Tenstorrent device architectures. The repository serves as the central hub for the tt-forge compiler project, bringing together sub-projects into a cohesive product with releases, demos, model support, roadmaps, and key resources. Users can explore the documentation for individual front ends to get started running tests and demos.

mcp-fundamentals
The mcp-fundamentals repository is a collection of fundamental concepts and examples related to microservices, cloud computing, and DevOps. It covers topics such as containerization, orchestration, CI/CD pipelines, and infrastructure as code. The repository provides hands-on exercises and code samples to help users understand and apply these concepts in real-world scenarios. Whether you are a beginner looking to learn the basics or an experienced professional seeking to refresh your knowledge, mcp-fundamentals has something for everyone.

truss
Truss is a tool that simplifies the process of serving AI/ML models in production. It provides a consistent and easy-to-use interface for packaging, testing, and deploying models, regardless of the framework they were created with. Truss also includes a live reload server for fast feedback during development, and a batteries-included model serving environment that eliminates the need for Docker and Kubernetes configuration.

sdk-python
Strands Agents is a lightweight and flexible SDK that takes a model-driven approach to building and running AI agents. It supports various model providers, offers advanced capabilities like multi-agent systems and streaming support, and comes with built-in MCP server support. Users can easily create tools using Python decorators, integrate MCP servers seamlessly, and leverage multiple model providers for different AI tasks. The SDK is designed to scale from simple conversational assistants to complex autonomous workflows, making it suitable for a wide range of AI development needs.

ml-retreat
ML-Retreat is a comprehensive machine learning library designed to simplify and streamline the process of building and deploying machine learning models. It provides a wide range of tools and utilities for data preprocessing, model training, evaluation, and deployment. With ML-Retreat, users can easily experiment with different algorithms, hyperparameters, and feature engineering techniques to optimize their models. The library is built with a focus on scalability, performance, and ease of use, making it suitable for both beginners and experienced machine learning practitioners.

langflow
Langflow is an open-source Python-powered visual framework designed for building multi-agent and RAG applications. It is fully customizable, language model agnostic, and vector store agnostic. Users can easily create flows by dragging components onto the canvas, connect them, and export the flow as a JSON file. Langflow also provides a command-line interface (CLI) for easy management and configuration, allowing users to customize the behavior of Langflow for development or specialized deployment scenarios. The tool can be deployed on various platforms such as Google Cloud Platform, Railway, and Render. Contributors are welcome to enhance the project on GitHub by following the contributing guidelines.

RLinf
RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models via reinforcement learning. It provides a robust backbone for next-generation training, supporting open-ended learning, continuous generalization, and limitless possibilities in intelligence development. The tool offers unique features like Macro-to-Micro Flow, flexible execution modes, auto-scheduling strategy, embodied agent support, and fast adaptation for mainstream VLA models. RLinf is fast with hybrid mode and automatic online scaling strategy, achieving significant throughput improvement and efficiency. It is also flexible and easy to use with multiple backend integrations, adaptive communication, and built-in support for popular RL methods. The roadmap includes system-level enhancements and application-level extensions to support various training scenarios and models. Users can get started with complete documentation, quickstart guides, key design principles, example gallery, advanced features, and guidelines for extending the framework. Contributions are welcome, and users are encouraged to cite the GitHub repository and acknowledge the broader open-source community.

dspy.rb
DSPy.rb is a Ruby framework for building reliable LLM applications using composable, type-safe modules. It enables developers to define typed signatures and compose them into pipelines, offering a more structured approach compared to traditional prompting. The framework embraces Ruby conventions and adds innovations like CodeAct agents and enhanced production instrumentation, resulting in scalable LLM applications that are robust and efficient. DSPy.rb is actively developed, with a focus on stability and real-world feedback through the 0.x series before reaching a stable v1.0 API.

motia
Motia is an AI agent framework designed for software engineers to create, test, and deploy production-ready AI agents quickly. It provides a code-first approach, allowing developers to write agent logic in familiar languages and visualize execution in real-time. With Motia, developers can focus on business logic rather than infrastructure, offering zero infrastructure headaches, multi-language support, composable steps, built-in observability, instant APIs, and full control over AI logic. Ideal for building sophisticated agents and intelligent automations, Motia's event-driven architecture and modular steps enable the creation of GenAI-powered workflows, decision-making systems, and data processing pipelines.

Memento
Memento is a lightweight and user-friendly version control tool designed for small to medium-sized projects. It provides a simple and intuitive interface for managing project versions and collaborating with team members. With Memento, users can easily track changes, revert to previous versions, and merge different branches. The tool is suitable for developers, designers, content creators, and other professionals who need a streamlined version control solution. Memento simplifies the process of managing project history and ensures that team members are always working on the latest version of the project.

mlx-lm
MLX LM is a Python package designed for generating text and fine-tuning large language models on Apple silicon using MLX. It offers integration with the Hugging Face Hub for easy access to thousands of LLMs, support for quantizing and uploading models to the Hub, low-rank and full model fine-tuning capabilities, and distributed inference and fine-tuning with `mx.distributed`. Users can interact with the package through command line options or the Python API, enabling tasks such as text generation, chatting with language models, model conversion, streaming generation, and sampling. MLX LM supports various Hugging Face models and provides tools for efficient scaling to long prompts and generations, including a rotating key-value cache and prompt caching. It requires macOS 15.0 or higher for optimal performance.

LightLLM
LightLLM is a lightweight library for linear and logistic regression models. It provides a simple and efficient way to train and deploy machine learning models for regression tasks. The library is designed to be easy to use and integrate into existing projects, making it suitable for both beginners and experienced data scientists. With LightLLM, users can quickly build and evaluate regression models using a variety of algorithms and hyperparameters. The library also supports feature engineering and model interpretation, allowing users to gain insights from their data and make informed decisions based on the model predictions.

superlinked
Superlinked is a compute framework for information retrieval and feature engineering systems, focusing on converting complex data into vector embeddings for RAG, Search, RecSys, and Analytics stack integration. It enables custom model performance in machine learning with pre-trained model convenience. The tool allows users to build multimodal vectors, define weights at query time, and avoid postprocessing & rerank requirements. Users can explore the computational model through simple scripts and python notebooks, with a future release planned for production usage with built-in data infra and vector database integrations.

beeai-framework
BeeAI Framework is a versatile tool for building production-ready multi-agent systems. It offers flexibility in orchestrating agents, seamless integration with various models and tools, and production-grade controls for scaling. The framework supports Python and TypeScript libraries, enabling users to implement simple to complex multi-agent patterns, connect with AI services, and optimize token usage and resource management.

FastGPT
FastGPT is a knowledge base Q&A system based on the LLM large language model, providing out-of-the-box data processing, model calling and other capabilities. At the same time, you can use Flow to visually arrange workflows to achieve complex Q&A scenarios!

deepflow
DeepFlow is an open-source project that provides deep observability for complex cloud-native and AI applications. It offers Zero Code data collection with eBPF for metrics, distributed tracing, request logs, and function profiling. DeepFlow is integrated with SmartEncoding to achieve Full Stack correlation and efficient access to all observability data. With DeepFlow, cloud-native and AI applications automatically gain deep observability, removing the burden of developers continually instrumenting code and providing monitoring and diagnostic capabilities covering everything from code to infrastructure for DevOps/SRE teams.
For similar tasks

tt-forge
TT-Forge is Tenstorrent's MLIR-based compiler that integrates into various compiler technologies from AI/ML frameworks to enable running models and custom kernel generation. It aims to provide abstraction of different frontend frameworks, compile various model architectures with good performance, and abstract all Tenstorrent device architectures. The repository serves as the central hub for the tt-forge compiler project, bringing together sub-projects into a cohesive product with releases, demos, model support, roadmaps, and key resources. Users can explore the documentation for individual front ends to get started running tests and demos.

tt-metal
TT-NN is a python & C++ Neural Network OP library. It provides a low-level programming model, TT-Metalium, enabling kernel development for Tenstorrent hardware.

mscclpp
MSCCL++ is a GPU-driven communication stack for scalable AI applications. It provides a highly efficient and customizable communication stack for distributed GPU applications. MSCCL++ redefines inter-GPU communication interfaces, delivering a highly efficient and customizable communication stack for distributed GPU applications. Its design is specifically tailored to accommodate diverse performance optimization scenarios often encountered in state-of-the-art AI applications. MSCCL++ provides communication abstractions at the lowest level close to hardware and at the highest level close to application API. The lowest level of abstraction is ultra light weight which enables a user to implement logics of data movement for a collective operation such as AllReduce inside a GPU kernel extremely efficiently without worrying about memory ordering of different ops. The modularity of MSCCL++ enables a user to construct the building blocks of MSCCL++ in a high level abstraction in Python and feed them to a CUDA kernel in order to facilitate the user's productivity. MSCCL++ provides fine-grained synchronous and asynchronous 0-copy 1-sided abstracts for communication primitives such as `put()`, `get()`, `signal()`, `flush()`, and `wait()`. The 1-sided abstractions allows a user to asynchronously `put()` their data on the remote GPU as soon as it is ready without requiring the remote side to issue any receive instruction. This enables users to easily implement flexible communication logics, such as overlapping communication with computation, or implementing customized collective communication algorithms without worrying about potential deadlocks. Additionally, the 0-copy capability enables MSCCL++ to directly transfer data between user's buffers without using intermediate internal buffers which saves GPU bandwidth and memory capacity. MSCCL++ provides consistent abstractions regardless of the location of the remote GPU (either on the local node or on a remote node) or the underlying link (either NVLink/xGMI or InfiniBand). This simplifies the code for inter-GPU communication, which is often complex due to memory ordering of GPU/CPU read/writes and therefore, is error-prone.

mlir-air
This repository contains tools and libraries for building AIR platforms, runtimes and compilers.

free-for-life
A massive list including a huge amount of products and services that are completely free! ⭐ Star on GitHub • 🤝 Contribute # Table of Contents * APIs, Data & ML * Artificial Intelligence * BaaS * Code Editors * Code Generation * DNS * Databases * Design & UI * Domains * Email * Font * For Students * Forms * Linux Distributions * Messaging & Streaming * PaaS * Payments & Billing * SSL

AIMr
AIMr is an AI aimbot tool written in Python that leverages modern technologies to achieve an undetected system with a pleasing appearance. It works on any game that uses human-shaped models. To optimize its performance, users should build OpenCV with CUDA. For Valorant, additional perks in the Discord and an Arduino Leonardo R3 are required.

aika
AIKA (Artificial Intelligence for Knowledge Acquisition) is a new type of artificial neural network designed to mimic the behavior of a biological brain more closely and bridge the gap to classical AI. The network conceptually separates activations from neurons, creating two separate graphs to represent acquired knowledge and inferred information. It uses different types of neurons and synapses to propagate activation values, binding signals, causal relations, and training gradients. The network structure allows for flexible topology and supports the gradual population of neurons and synapses during training.

nextpy
Nextpy is a cutting-edge software development framework optimized for AI-based code generation. It provides guardrails for defining AI system boundaries, structured outputs for prompt engineering, a powerful prompt engine for efficient processing, better AI generations with precise output control, modularity for multiplatform and extensible usage, developer-first approach for transferable knowledge, and containerized & scalable deployment options. It offers 4-10x faster performance compared to Streamlit apps, with a focus on cooperation within the open-source community and integration of key components from various projects.
For similar jobs

Qwen-TensorRT-LLM
Qwen-TensorRT-LLM is a project developed for the NVIDIA TensorRT Hackathon 2023, focusing on accelerating inference for the Qwen-7B-Chat model using TRT-LLM. The project offers various functionalities such as FP16/BF16 support, INT8 and INT4 quantization options, Tensor Parallel for multi-GPU parallelism, web demo setup with gradio, Triton API deployment for maximum throughput/concurrency, fastapi integration for openai requests, CLI interaction, and langchain support. It supports models like qwen2, qwen, and qwen-vl for both base and chat models. The project also provides tutorials on Bilibili and blogs for adapting Qwen models in NVIDIA TensorRT-LLM, along with hardware requirements and quick start guides for different model types and quantization methods.

dl_model_infer
This project is a c++ version of the AI reasoning library that supports the reasoning of tensorrt models. It provides accelerated deployment cases of deep learning CV popular models and supports dynamic-batch image processing, inference, decode, and NMS. The project has been updated with various models and provides tutorials for model exports. It also includes a producer-consumer inference model for specific tasks. The project directory includes implementations for model inference applications, backend reasoning classes, post-processing, pre-processing, and target detection and tracking. Speed tests have been conducted on various models, and onnx downloads are available for different models.

joliGEN
JoliGEN is an integrated framework for training custom generative AI image-to-image models. It implements GAN, Diffusion, and Consistency models for various image translation tasks, including domain and style adaptation with conservation of semantics. The tool is designed for real-world applications such as Controlled Image Generation, Augmented Reality, Dataset Smart Augmentation, and Synthetic to Real transforms. JoliGEN allows for fast and stable training with a REST API server for simplified deployment. It offers a wide range of options and parameters with detailed documentation available for models, dataset formats, and data augmentation.

ai-edge-torch
AI Edge Torch is a Python library that supports converting PyTorch models into a .tflite format for on-device applications on Android, iOS, and IoT devices. It offers broad CPU coverage with initial GPU and NPU support, closely integrating with PyTorch and providing good coverage of Core ATen operators. The library includes a PyTorch converter for model conversion and a Generative API for authoring mobile-optimized PyTorch Transformer models, enabling easy deployment of Large Language Models (LLMs) on mobile devices.

awesome-RK3588
RK3588 is a flagship 8K SoC chip by Rockchip, integrating Cortex-A76 and Cortex-A55 cores with NEON coprocessor for 8K video codec. This repository curates resources for developing with RK3588, including official resources, RKNN models, projects, development boards, documentation, tools, and sample code.

cl-waffe2
cl-waffe2 is an experimental deep learning framework in Common Lisp, providing fast, systematic, and customizable matrix operations, reverse mode tape-based Automatic Differentiation, and neural network model building and training features accelerated by a JIT Compiler. It offers abstraction layers, extensibility, inlining, graph-level optimization, visualization, debugging, systematic nodes, and symbolic differentiation. Users can easily write extensions and optimize their networks without overheads. The framework is designed to eliminate barriers between users and developers, allowing for easy customization and extension.

TensorRT-Model-Optimizer
The NVIDIA TensorRT Model Optimizer is a library designed to quantize and compress deep learning models for optimized inference on GPUs. It offers state-of-the-art model optimization techniques including quantization and sparsity to reduce inference costs for generative AI models. Users can easily stack different optimization techniques to produce quantized checkpoints from torch or ONNX models. The quantized checkpoints are ready for deployment in inference frameworks like TensorRT-LLM or TensorRT, with planned integrations for NVIDIA NeMo and Megatron-LM. The tool also supports 8-bit quantization with Stable Diffusion for enterprise users on NVIDIA NIM. Model Optimizer is available for free on NVIDIA PyPI, and this repository serves as a platform for sharing examples, GPU-optimized recipes, and collecting community feedback.

depthai
This repository contains a demo application for DepthAI, a tool that can load different networks, create pipelines, record video, and more. It provides documentation for installation and usage, including running programs through Docker. Users can explore DepthAI features via command line arguments or a clickable QT interface. Supported models include various AI models for tasks like face detection, human pose estimation, and object detection. The tool collects anonymous usage statistics by default, which can be disabled. Users can report issues to the development team for support and troubleshooting.