llm-analysis
Latency and Memory Analysis of Transformer Models for Training and Inference
Stars: 300
llm-analysis is a tool designed for Latency and Memory Analysis of Transformer Models for Training and Inference. It automates the calculation of training or inference latency and memory usage for Large Language Models (LLMs) or Transformers based on specified model, GPU, data type, and parallelism configurations. The tool helps users to experiment with different setups theoretically, understand system performance, and optimize training/inference scenarios. It supports various parallelism schemes, communication methods, activation recomputation options, data types, and fine-tuning strategies. Users can integrate llm-analysis in their code using the `LLMAnalysis` class or use the provided entry point functions for command line interface. The tool provides lower-bound estimations of memory usage and latency, and aims to assist in achieving feasible and optimal setups for training or inference.
README:
Latency and Memory Analysis of Transformer Models for Training and Inference
- llm-analysis
Many formulas or equations are floating around in papers, blogs, etc., about how to calculate training or inference latency and memory for Large Language Models (LLMs) or Transformers. Rather than doing math on papers or typing in Excel sheets, let's automate the boring stuff with llm-analysis
⚙️!
Given the specified model, GPU, data type, and parallelism configurations, llm-analysis estimates the latency and memory usage of LLMs for training or inference. With llm-analysis, one can easily try out different training/inference setups theoretically, and better understand the system performance for different scenarios.
llm-analysis helps answer questions such as:
- what batch size, data type, parallelism scheme to use to get a
feasible
(not getting OOM) andoptimal
(maximizing throughput with a latency constraint) setup for training or inference -
time
it takes with the given setup to do training or inference and thecost
(GPU-hours) - how the latency/memory changes if using a different model, GPU type, number of GPU, data type for weights and activations, parallelism configuration (suggesting the performance benefit of
modeling change
,hardware improvement
,quantization
,parallelism
, etc.)
Check the example use cases. With llm-analysis, you can do such analysis in minutes 🚀!
-
To install llm-analysis from pypi:
pip install llm-analysis
-
To install the latest development build:
pip install --upgrade git+https://github.com/cli99/llm-analysis.git@main
-
To install from source, clone the repo and run
pip install .
orpoetry install
(install poetry bypip install poetry
).
To integrate llm-analysis in your code, use the LLMAnalysis
class. Refer to doc LLMAnalysis for details.
LLMAnalysis
is constructed with flops and memory efficiency numbers and the following configuration classes:
-
ModelConfig
covers model information, i.e. max sequence length, number of transformer layers, number of attention heads, hidden dimension, vocabulary size -
GPUConfig
covers GPU compute and memory specifications -
DtypeConfig
covers the number of bits used for the model weight, activation, and embedding -
ParallelismConfig
covers Tensor Parallelism (tp
), Pipeline Parallelism (pp
), Sequence Parallelism (sp
), Expert Parallelism (ep
),and Data Parallelism (dp
).
Then LLMAnalysis
can be queried with different arguments through the training and inference methods.
llm-analysis provides two entry functions, train and infer, for ease of use through the command line interface. Run
python -m llm_analysis.analysis train --help
or
python -m llm_analysis.analysis infer --help
to check the options or read the linked doc. Refer to the examples to see how they are used.
train
and infer
use the pre-defined name-to-configuration mappings (model_configs
, gpu_configs
, dtype_configs
) and other user-input arguments to construct the LLMAnalysis
and do the query.
The pre-defined mappings are populated at the runtime from the model, GPU, and data type configuration json
files under model_configs, gpu_configs, and dtype_configs. To add a new model, GPU or data type to the mapping for query, just add a json
description file to the corresponding folder.
llm-analysis also supports retrieving ModelConfig
from a model config json file path or Hugging Face with the model name .
- From a local model config json file, e.g.,
python -m llm_analysis.analysis train --model_name=local_example_model.json
. Check the model configurations under the model_configs folder. - From Hugging Face, e.g., use
EleutherAI/gpt-neox-20b
asmodel_name
when calling thetrain
orinfer
entry functions.python -m llm_analysis.analysis train --model_name=EleutherAI/gpt-neox-20b --total_num_gpus 32 --ds_zero 3
. With this method, llm-analysis relies ontransformers
to find the corresponding model configuration on huggingface.co/models, meaning information of newer models only exist after certain version of the transformers library. To access latest models through their names, update the installedtransformers
package.
A list of handy commands is provided to query against the pre-defined mappings as well as Hugging Face, or to dump configurations. Run python -m llm_analysis.config --help
for details.
Some examples:
python -m llm_analysis.config get_model_config_by_name EleutherAI/gpt-neox-20b
gets the ModelConfig
from the populated mapping by name, if not found, llm-analysis tries to get it from HuggingFace.
Note that LLaMA models need at least transformers-4.28.1
to retrieve, either update to a later transformers
library, or use the predefined ModelConfig
for LLaMA models (/
in model names are replaced with _
).
python -m llm_analysis.config list_gpu_configs
lists the names of all predefined GPU configurations, then you can query with
python -m llm_analysis.config get_gpu_config_by_name a100-sxm-80gb
to show the corresponding GPUConfig
.
Setting flops and memory efficiency to 1
(default) gives the lower bound of training or inference latency, as it assumes the peak hardware performance (which is never the case).
A close-to-reality flops or memory efficiency can be found by benchmarking and profiling using the input dimensions in the model.
If one has to make assumptions, for flops efficiency, literature reports up to 0.5
for large scale model training, and up to 0.7
for inference; 0.9
can be an aggressive target for memory efficiency.
llm-analysis aims to provide a lower-bound
estimation of memory usage and latency.
llm-analysis currently covers Tensor Parallelism (tp), Pipeline Parallelism (pp), Sequence Parallelism (sp), Expert Parallelism (ep), and Data Parallelism (dp).
-
tp, pp, and sp adopt the style of parallelization used in
Megatron-LM
for training andFasterTransformer
for inference -
In the training analysis, dp sharding assumes using
DeepSpeed ZeRO
orFSDP
.ds_zero
is used to specify the dp sharding strategyds_zero DeepSpeed ZeRO FSDP Sharding 0 disabled NO_SHARD No sharding 1 Stage 1 N/A Shard optimizer states 2 Stage 2 SHARD_GRAD_OP Shard gradients and optimizer states 3 Stage 3 FULL_SHARD Shard gradients, optimizer states, model parameters -
ep parallelizes the number of MLP experts across
ep_size
devices, i.e. the number of experts per GPU istotal number of experts / ep_size
. Thus for the MLP module, the number of devices for other parallelization dimensions is divided byep_size
compared to other parts of the model.
tp communication is calculated as using ring allreduce
. ep communication is calculated as using alltoall
.
dp communication time to unshard model weight when using FSDP or DeepSpeed ZeRO is estimated and compared against the compute latency, the larger value of the two is used for the overall latency.
Other dp and pp communications are ignored for now, i.e. assuming perfect computation and communication overlapping, which is not true when communication cannot overlap with compute due to dependency, or when communication is too long to hide due to slow interconnect or large data volume.
llm-analysis supports both full and selective activation recomputation.
activation_recomputation | what is checkpointed and recomputed |
---|---|
0 | No activation recomputation; requires the most amount of memory |
1 | Checkpoints the attention computation (QK^T matrix multiply, softmax, softmax dropout, and attention over V.) in the attention module of a transformer layer; as described in Reducing Activation Recomputation in Large Transformer Models. |
2 | Checkpoints the input to the attention module in a transformer layer; requires an extra forward pass on attention. |
3 | Checkpoints the input to the sequence of modules (layernom-attention-layernom) in a transformer layer; requires an extra forward pass on (layernom-attention-layernom). |
4 | Full activation recomputation stores the input to the transformer layer; requires the least amount of memory; requires an extra forward pass of the entire layer. |
Data types are expressed with the number of bits, only 32
(FP32, TF32), 16
(FP16, BF16), 8
(INT8), and 4
(INT4) bits data types are modeled for now.
Fine-tuning is modeled the same (controlled by total_num_tokens
passed to the train
entry function) as pre-training, thus assuming full (all model parameters) fine-tuning. Parameter-efficient fine-tuning (PEFT) is
in future support.
Inference assumes perfect overlapping of compute and memory operations when calculating latency, and maximum memory reuse when calculating memory usage.
TODOs (stay tuned 📻)
Check the TODOs below for what's next and stay tuned 📻! Any contributions or feedback are highly welcome!
- [ ] Add dp (across and within a node), ep (within a node), pp (across nodes) communication analysis
- [ ] Support efficient fine-tuning methods such as LoRA or Adapters
- [ ] Add FP8 datatype support
- [ ] Support CPU offloading (weight, KV cache, etc.) analysis in training and inference
- [ ] Support other hardware (e.g. CPU) for inference analysis
If you use llm-analysis in your work, please cite:
Cheng Li. (2023). LLM-Analysis: Latency and Memory Analysis of Transformer Models for Training and Inference. GitHub repository, https://github.com/cli99/llm-analysis.
or
@misc{llm-analysis-chengli,
author = {Cheng Li},
title = {LLM-Analysis: Latency and Memory Analysis of Transformer Models for Training and Inference},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/cli99/llm-analysis}},
}
Contributions and suggestions are welcome.
llm-analysis uses pre-commit to ensure code formatting is consistent. For pull requests with code contribution, please install the pre-commit (pip install pre-commit
) as well as the used hooks (pip install
in the repo), and format the code (runs automatically before each git commit) before submitting the PR.
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
- Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
- Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
- Reducing Activation Recomputation in Large Transformer Models
- Training Compute-Optimal Large Language Models
- Efficiently Scaling Transformer Inference
- Training Compute-Optimal Large Language Models
- Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases
- A Comprehensive Study on Post-Training Quantization for Large Language Models
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for llm-analysis
Similar Open Source Tools
llm-analysis
llm-analysis is a tool designed for Latency and Memory Analysis of Transformer Models for Training and Inference. It automates the calculation of training or inference latency and memory usage for Large Language Models (LLMs) or Transformers based on specified model, GPU, data type, and parallelism configurations. The tool helps users to experiment with different setups theoretically, understand system performance, and optimize training/inference scenarios. It supports various parallelism schemes, communication methods, activation recomputation options, data types, and fine-tuning strategies. Users can integrate llm-analysis in their code using the `LLMAnalysis` class or use the provided entry point functions for command line interface. The tool provides lower-bound estimations of memory usage and latency, and aims to assist in achieving feasible and optimal setups for training or inference.
raft
RAFT (Reusable Accelerated Functions and Tools) is a C++ header-only template library with an optional shared library that contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.
paxml
Pax is a framework to configure and run machine learning experiments on top of Jax.
lerobot
LeRobot is a state-of-the-art AI library for real-world robotics in PyTorch. It aims to provide models, datasets, and tools to lower the barrier to entry to robotics, focusing on imitation learning and reinforcement learning. LeRobot offers pretrained models, datasets with human-collected demonstrations, and simulation environments. It plans to support real-world robotics on affordable and capable robots. The library hosts pretrained models and datasets on the Hugging Face community page.
LeanCopilot
Lean Copilot is a tool that enables the use of large language models (LLMs) in Lean for proof automation. It provides features such as suggesting tactics/premises, searching for proofs, and running inference of LLMs. Users can utilize built-in models from LeanDojo or bring their own models to run locally or on the cloud. The tool supports platforms like Linux, macOS, and Windows WSL, with optional CUDA and cuDNN for GPU acceleration. Advanced users can customize behavior using Tactic APIs and Model APIs. Lean Copilot also allows users to bring their own models through ExternalGenerator or ExternalEncoder. The tool comes with caveats such as occasional crashes and issues with premise selection and proof search. Users can get in touch through GitHub Discussions for questions, bug reports, feature requests, and suggestions. The tool is designed to enhance theorem proving in Lean using LLMs.
yalm
Yalm (Yet Another Language Model) is an LLM inference implementation in C++/CUDA, emphasizing performance engineering, documentation, scientific optimizations, and readability. It is not for production use and has been tested on Mistral-v0.2 and Llama-3.2. Requires C++20-compatible compiler, CUDA toolkit, and LLM safetensor weights in huggingface format converted to .yalm file.
PDEBench
PDEBench provides a diverse and comprehensive set of benchmarks for scientific machine learning, including challenging and realistic physical problems. The repository consists of code for generating datasets, uploading and downloading datasets, training and evaluating machine learning models as baselines. It features a wide range of PDEs, realistic and difficult problems, ready-to-use datasets with various conditions and parameters. PDEBench aims for extensibility and invites participation from the SciML community to improve and extend the benchmark.
storm
STORM is a LLM system that writes Wikipedia-like articles from scratch based on Internet search. While the system cannot produce publication-ready articles that often require a significant number of edits, experienced Wikipedia editors have found it helpful in their pre-writing stage. **Try out our [live research preview](https://storm.genie.stanford.edu/) to see how STORM can help your knowledge exploration journey and please provide feedback to help us improve the system 🙏!**
FlexFlow
FlexFlow Serve is an open-source compiler and distributed system for **low latency**, **high performance** LLM serving. FlexFlow Serve outperforms existing systems by 1.3-2.0x for single-node, multi-GPU inference and by 1.4-2.4x for multi-node, multi-GPU inference.
rtdl-num-embeddings
This repository provides the official implementation of the paper 'On Embeddings for Numerical Features in Tabular Deep Learning'. It focuses on transforming scalar continuous features into vectors before integrating them into the main backbone of tabular neural networks, showcasing improved performance. The embeddings for continuous features are shown to enhance the performance of tabular DL models and are applicable to various conventional backbones, offering efficiency comparable to Transformer-based models. The repository includes Python packages for practical usage, exploration of metrics and hyperparameters, and reproducing reported results for different algorithms and datasets.
Co-LLM-Agents
This repository contains code for building cooperative embodied agents modularly with large language models. The agents are trained to perform tasks in two different environments: ThreeDWorld Multi-Agent Transport (TDW-MAT) and Communicative Watch-And-Help (C-WAH). TDW-MAT is a multi-agent environment where agents must transport objects to a goal position using containers. C-WAH is an extension of the Watch-And-Help challenge, which enables agents to send messages to each other. The code in this repository can be used to train agents to perform tasks in both of these environments.
lhotse
Lhotse is a Python library designed to make speech and audio data preparation flexible and accessible. It aims to attract a wider community to speech processing tasks by providing a Python-centric design and an expressive command-line interface. Lhotse offers standard data preparation recipes, PyTorch Dataset classes for speech tasks, and efficient data preparation for model training with audio cuts. It supports data augmentation, feature extraction, and feature-space cut mixing. The tool extends Kaldi's data preparation recipes with seamless PyTorch integration, human-readable text manifests, and convenient Python classes.
generative-models
Generative Models by Stability AI is a repository that provides various generative models for research purposes. It includes models like Stable Video 4D (SV4D) for video synthesis, Stable Video 3D (SV3D) for multi-view synthesis, SDXL-Turbo for text-to-image generation, and more. The repository focuses on modularity and implements a config-driven approach for building and combining submodules. It supports training with PyTorch Lightning and offers inference demos for different models. Users can access pre-trained models like SDXL-base-1.0 and SDXL-refiner-1.0 under a CreativeML Open RAIL++-M license. The codebase also includes tools for invisible watermark detection in generated images.
lantern
Lantern is an open-source PostgreSQL database extension designed to store vector data, generate embeddings, and handle vector search operations efficiently. It introduces a new index type called 'lantern_hnsw' for vector columns, which speeds up 'ORDER BY ... LIMIT' queries. Lantern utilizes the state-of-the-art HNSW implementation called usearch. Users can easily install Lantern using Docker, Homebrew, or precompiled binaries. The tool supports various distance functions, index construction parameters, and operator classes for efficient querying. Lantern offers features like embedding generation, interoperability with pgvector, parallel index creation, and external index graph generation. It aims to provide superior performance metrics compared to other similar tools and has a roadmap for future enhancements such as cloud-hosted version, hardware-accelerated distance metrics, industry-specific application templates, and support for version control and A/B testing of embeddings.
For similar tasks
llm-analysis
llm-analysis is a tool designed for Latency and Memory Analysis of Transformer Models for Training and Inference. It automates the calculation of training or inference latency and memory usage for Large Language Models (LLMs) or Transformers based on specified model, GPU, data type, and parallelism configurations. The tool helps users to experiment with different setups theoretically, understand system performance, and optimize training/inference scenarios. It supports various parallelism schemes, communication methods, activation recomputation options, data types, and fine-tuning strategies. Users can integrate llm-analysis in their code using the `LLMAnalysis` class or use the provided entry point functions for command line interface. The tool provides lower-bound estimations of memory usage and latency, and aims to assist in achieving feasible and optimal setups for training or inference.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.