TensorRT-Model-Optimizer

TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, sparsity, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.

Stars: 438

Visit
 screenshot

The NVIDIA TensorRT Model Optimizer is a library designed to quantize and compress deep learning models for optimized inference on GPUs. It offers state-of-the-art model optimization techniques including quantization and sparsity to reduce inference costs for generative AI models. Users can easily stack different optimization techniques to produce quantized checkpoints from torch or ONNX models. The quantized checkpoints are ready for deployment in inference frameworks like TensorRT-LLM or TensorRT, with planned integrations for NVIDIA NeMo and Megatron-LM. The tool also supports 8-bit quantization with Stable Diffusion for enterprise users on NVIDIA NIM. Model Optimizer is available for free on NVIDIA PyPI, and this repository serves as a platform for sharing examples, GPU-optimized recipes, and collecting community feedback.

README:

NVIDIA TensorRT Model Optimizer

A Library to Quantize and Compress Deep Learning Models for Optimized Inference on GPUs

Documentation version license

Examples | Benchmark Results | Documentation

Latest News

Table of Contents

Model Optimizer Overview

Minimizing inference costs presents a significant challenge as generative AI models continue to grow in complexity and size. The NVIDIA TensorRT Model Optimizer (referred to as Model Optimizer, or ModelOpt) is a library comprising state-of-the-art model optimization techniques including quantization, sparsity, distillation, and pruning to compress models. It accepts a torch or ONNX model as inputs and provides Python APIs for users to easily stack different model optimization techniques to produce an optimized quantized checkpoint. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like TensorRT-LLM or TensorRT. Further integrations are planned for NVIDIA NeMo and Megatron-LM for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on NVIDIA NIM.

Model Optimizer is available for free for all developers on NVIDIA PyPI. This repository is for sharing examples and GPU-optimized recipes as well as collecting feedback from the community.

Installation

pip install "nvidia-modelopt[all]~=0.17.0" --extra-index-url https://pypi.nvidia.com

See the installation guide for more fine-grained control over the installation.

Make sure to also install example-specific dependencies from their respective requirements.txt files if any.

Docker

After installing the NVIDIA Container Toolkit, please run the following commands to build the Model Optimizer example docker container which has all the necessary dependencies pre-installed for running the examples.

# Build the docker
docker/build.sh

# Obtain and start the basic docker image environment.
# The default built docker image is docker.io/library/modelopt_examples:latest
docker run --gpus all -it --shm-size 20g --rm docker.io/library/modelopt_examples:latest bash

# Check installation
python -c "import modelopt"

NOTE: Unless specified otherwise, all example READMEs assume they are using the ModelOpt docker image for running the examples.

Alternatively for PyTorch, you can also use NVIDIA NGC PyTorch container with Model Optimizer pre-installed starting from 24.06 container. Make sure to update the Model Optimizer version to the latest one if not already.

Techniques

Quantization

Quantization is an effective model optimization technique for large models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality. Model Optimizer enables highly performant quantization formats including FP8, INT8, INT4, etc and supports advanced algorithms such as SmoothQuant, AWQ, and Double Quantization with easy-to-use Python APIs. Both Post-training quantization (PTQ) and Quantization-aware training (QAT) are supported.

Sparsity

Sparsity is a technique to further reduce the memory footprint of deep learning models and accelerate the inference. Model Optimizer Python APIs to apply weight sparsity to a given model. It also supports NVIDIA 2:4 sparsity pattern and various sparsification methods, such as NVIDIA ASP and SparseGPT.

Pruning

Pruning is a technique to reduce the model size and accelerate the inference by removing unnecessary weights. Model Optimizer provides Python APIs to prune Linear and Conv layers, and Transformer attention heads, MLP, and depth.

Distillation

Knowledge Distillation allows for increasing the accuracy and/or convergence speed of a desired model architecture by using a more powerful model's learned features to guide a student model's objective function into imitating it.

Examples

  • PTQ for LLMs covers how to use Post-training quantization (PTQ) and export to TensorRT-LLM for deployment for popular pre-trained models from frameworks like
  • PTQ for Diffusers walks through how to quantize a diffusion model with FP8 or INT8, export to ONNX, and deploy with TensorRT. The Diffusers example in this repo is complementary to the demoDiffusion example in TensorRT repo and includes FP8 plugins as well as the latest updates on INT8 quantization.
  • QAT for LLMs demonstrates the recipe and workflow for Quantization-aware Training (QAT), which can further preserve model accuracy at low precisions (e.g., INT4, or 4-bit in NVIDIA Blackwell platform).
  • Sparsity for LLMs shows how to perform Post-training Sparsification and Sparsity-aware fine-tuning on a pre-trained Hugging Face model.
  • Pruning demonstrates how to optimally prune Linear and Conv layers, and Transformer attention heads, MLP, and depth using the Model Optimizer for following frameworks:
    • NVIDIA NeMo / NVIDIA Megatron-LM GPT-style models (e.g. Llama 3, Mistral NeMo, etc.)
    • Hugging Face language models like BERT and GPT-J
    • Computer Vision models like NVIDIA Tao framework detection models.
  • ONNX PTQ shows how to quantize the ONNX models in INT4 or INT8 quantization mode. The examples also include the deployment of quantized ONNX models using TensorRT.
  • Distillation for LLMs demonstrates how to use Knowledge Distillation, which can increasing the accuracy and/or convergence speed for finetuning / QAT.
  • Chained Optimizations shows how to chain multiple optimizations together (e.g. Pruning + Distillation + Quantization).

Support Matrix

Benchmark

Please find the benchmarks here.

Release Notes

Please see Model Optimizer Changelog here.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for TensorRT-Model-Optimizer

Similar Open Source Tools

For similar tasks

For similar jobs