llm-analysis

Latency and Memory Analysis of Transformer Models for Training and Inference

Stars: 300

Visit

llm-analysis is a tool designed for Latency and Memory Analysis of Transformer Models for Training and Inference. It automates the calculation of training or inference latency and memory usage for Large Language Models (LLMs) or Transformers based on specified model, GPU, data type, and parallelism configurations. The tool helps users to experiment with different setups theoretically, understand system performance, and optimize training/inference scenarios. It supports various parallelism schemes, communication methods, activation recomputation options, data types, and fine-tuning strategies. Users can integrate llm-analysis in their code using the `LLMAnalysis` class or use the provided entry point functions for command line interface. The tool provides lower-bound estimations of memory usage and latency, and aims to assist in achieving feasible and optimal setups for training or inference.

README:

llm-analysis

Latency and Memory Analysis of Transformer Models for Training and Inference

llm-analysis

Overview

Many formulas or equations are floating around in papers, blogs, etc., about how to calculate training or inference latency and memory for Large Language Models (LLMs) or Transformers. Rather than doing math on papers or typing in Excel sheets, let's automate the boring stuff with llm-analysis ⚙️!

Given the specified model, GPU, data type, and parallelism configurations, llm-analysis estimates the latency and memory usage of LLMs for training or inference. With llm-analysis, one can easily try out different training/inference setups theoretically, and better understand the system performance for different scenarios.

llm-analysis helps answer questions such as:

what batch size, data type, parallelism scheme to use to get a feasible (not getting OOM) and optimal (maximizing throughput with a latency constraint) setup for training or inference
time it takes with the given setup to do training or inference and the cost (GPU-hours)
how the latency/memory changes if using a different model, GPU type, number of GPU, data type for weights and activations, parallelism configuration (suggesting the performance benefit of modeling change, hardware improvement, quantization, parallelism, etc.)

Examples

Check the example use cases. With llm-analysis, you can do such analysis in minutes 🚀!

Quick Start

To install llm-analysis from pypi:
```
pip install llm-analysis
```

To install the latest development build:

pip install --upgrade git+https://github.com/cli99/llm-analysis.git@main

To install from source, clone the repo and run pip install . or poetry install (install poetry by pip install poetry).

Using the `LLMAnalysis` class

To integrate llm-analysis in your code, use the LLMAnalysis class. Refer to doc LLMAnalysis for details.

LLMAnalysis is constructed with flops and memory efficiency numbers and the following configuration classes:

ModelConfig covers model information, i.e. max sequence length, number of transformer layers, number of attention heads, hidden dimension, vocabulary size
GPUConfig covers GPU compute and memory specifications
DtypeConfig covers the number of bits used for the model weight, activation, and embedding
ParallelismConfig covers Tensor Parallelism (tp), Pipeline Parallelism (pp), Sequence Parallelism (sp), Expert Parallelism (ep),and Data Parallelism (dp).

Then LLMAnalysis can be queried with different arguments through the training and inference methods.

Using the Entry Point Functions for Command Line

llm-analysis provides two entry functions, train and infer, for ease of use through the command line interface. Run

python -m llm_analysis.analysis train --help

python -m llm_analysis.analysis infer --help

to check the options or read the linked doc. Refer to the examples to see how they are used.

train and infer use the pre-defined name-to-configuration mappings (model_configs, gpu_configs, dtype_configs) and other user-input arguments to construct the LLMAnalysis and do the query.

The pre-defined mappings are populated at the runtime from the model, GPU, and data type configuration json files under model_configs, gpu_configs, and dtype_configs. To add a new model, GPU or data type to the mapping for query, just add a json description file to the corresponding folder.

llm-analysis also supports retrieving ModelConfig from a model config json file path or Hugging Face with the model name .

From a local model config json file, e.g., python -m llm_analysis.analysis train --model_name=local_example_model.json. Check the model configurations under the model_configs folder.
From Hugging Face, e.g., use EleutherAI/gpt-neox-20b as model_name when calling the train or infer entry functions. python -m llm_analysis.analysis train --model_name=EleutherAI/gpt-neox-20b --total_num_gpus 32 --ds_zero 3. With this method, llm-analysis relies on transformers to find the corresponding model configuration on huggingface.co/models, meaning information of newer models only exist after certain version of the transformers library. To access latest models through their names, update the installed transformers package.

A list of handy commands is provided to query against the pre-defined mappings as well as Hugging Face, or to dump configurations. Run python -m llm_analysis.config --help for details.

Some examples:

python -m llm_analysis.config get_model_config_by_name EleutherAI/gpt-neox-20b

gets the ModelConfig from the populated mapping by name, if not found, llm-analysis tries to get it from HuggingFace.

Note that LLaMA models need at least transformers-4.28.1 to retrieve, either update to a later transformers library, or use the predefined ModelConfig for LLaMA models (/ in model names are replaced with _).

python -m llm_analysis.config list_gpu_configs

lists the names of all predefined GPU configurations, then you can query with

python -m llm_analysis.config get_gpu_config_by_name a100-sxm-80gb

to show the corresponding GPUConfig.

How to Set FLOPS and Memory Efficiency

Setting flops and memory efficiency to 1 (default) gives the lower bound of training or inference latency, as it assumes the peak hardware performance (which is never the case). A close-to-reality flops or memory efficiency can be found by benchmarking and profiling using the input dimensions in the model.

If one has to make assumptions, for flops efficiency, literature reports up to 0.5 for large scale model training, and up to 0.7 for inference; 0.9 can be an aggressive target for memory efficiency.

Current Scope and Limitations

llm-analysis aims to provide a lower-bound estimation of memory usage and latency.

Parallelism Scheme

llm-analysis currently covers Tensor Parallelism (tp), Pipeline Parallelism (pp), Sequence Parallelism (sp), Expert Parallelism (ep), and Data Parallelism (dp).

tp, pp, and sp adopt the style of parallelization used in Megatron-LM for training and FasterTransformer for inference

In the training analysis, dp sharding assumes using DeepSpeed ZeRO or FSDP. ds_zero is used to specify the dp sharding strategy

ds_zero	DeepSpeed ZeRO	FSDP	Sharding
0	disabled	NO_SHARD	No sharding
1	Stage 1	N/A	Shard optimizer states
2	Stage 2	SHARD_GRAD_OP	Shard gradients and optimizer states
3	Stage 3	FULL_SHARD	Shard gradients, optimizer states, model parameters

ep parallelizes the number of MLP experts across ep_size devices, i.e. the number of experts per GPU is total number of experts / ep_size. Thus for the MLP module, the number of devices for other parallelization dimensions is divided by ep_size compared to other parts of the model.

Communication

tp communication is calculated as using ring allreduce. ep communication is calculated as using alltoall. dp communication time to unshard model weight when using FSDP or DeepSpeed ZeRO is estimated and compared against the compute latency, the larger value of the two is used for the overall latency. Other dp and pp communications are ignored for now, i.e. assuming perfect computation and communication overlapping, which is not true when communication cannot overlap with compute due to dependency, or when communication is too long to hide due to slow interconnect or large data volume.

Activation Recomputation

llm-analysis supports both full and selective activation recomputation.

activation_recomputation	what is checkpointed and recomputed
0	No activation recomputation; requires the most amount of memory
1	Checkpoints the attention computation (QK^T matrix multiply, softmax, softmax dropout, and attention over V.) in the attention module of a transformer layer; as described in Reducing Activation Recomputation in Large Transformer Models.
2	Checkpoints the input to the attention module in a transformer layer; requires an extra forward pass on attention.
3	Checkpoints the input to the sequence of modules (layernom-attention-layernom) in a transformer layer; requires an extra forward pass on (layernom-attention-layernom).
4	Full activation recomputation stores the input to the transformer layer; requires the least amount of memory; requires an extra forward pass of the entire layer.

Data Types

Data types are expressed with the number of bits, only 32 (FP32, TF32), 16 (FP16, BF16), 8 (INT8), and 4 (INT4) bits data types are modeled for now.

Fine-Tuning

Fine-tuning is modeled the same (controlled by total_num_tokens passed to the train entry function) as pre-training, thus assuming full (all model parameters) fine-tuning. Parameter-efficient fine-tuning (PEFT) is in future support.

Assumptions in Inference

Inference assumes perfect overlapping of compute and memory operations when calculating latency, and maximum memory reuse when calculating memory usage.

TODOs (stay tuned 📻)

Check the TODOs below for what's next and stay tuned 📻! Any contributions or feedback are highly welcome!

[ ] Add dp (across and within a node), ep (within a node), pp (across nodes) communication analysis
[ ] Support efficient fine-tuning methods such as LoRA or Adapters
[ ] Add FP8 datatype support
[ ] Support CPU offloading (weight, KV cache, etc.) analysis in training and inference
[ ] Support other hardware (e.g. CPU) for inference analysis

Citation

If you use llm-analysis in your work, please cite:

Cheng Li. (2023). LLM-Analysis: Latency and Memory Analysis of Transformer Models for Training and Inference. GitHub repository, https://github.com/cli99/llm-analysis.

@misc{llm-analysis-chengli,
  author = {Cheng Li},
  title = {LLM-Analysis: Latency and Memory Analysis of Transformer Models for Training and Inference},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/cli99/llm-analysis}},
}

Contributing

Contributions and suggestions are welcome.

llm-analysis uses pre-commit to ensure code formatting is consistent. For pull requests with code contribution, please install the pre-commit (pip install pre-commit) as well as the used hooks (pip install in the repo), and format the code (runs automatically before each git commit) before submitting the PR.

Useful Links

For Tasks:

Click tags to check more tools for each tasks

optimize training setup estimate memory usage analyze latency experiment with configurations understand system performance

For Jobs:

machine learning engineer data scientist ai researcher software developer data engineer

Alternative AI tools for llm-analysis

Similar Open Source Tools

llm-analysis

github

: 300

raft

RAFT (Reusable Accelerated Functions and Tools) is a C++ header-only template library with an optional shared library that contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.

github

: 860

ai8x-synthesis

github

: 55

ai8x-training

github

: 86

kvpress

This repository implements multiple key-value cache pruning methods and benchmarks using transformers, aiming to simplify the development of new methods for researchers and developers in the field of long-context language models. It provides a set of 'presses' that compress the cache during the pre-filling phase, with each press having a compression ratio attribute. The repository includes various training-free presses, special presses, and supports KV cache quantization. Users can contribute new presses and evaluate the performance of different presses on long-context datasets.

github

: 439

paxml

Pax is a framework to configure and run machine learning experiments on top of Jax.

github

: 448

LeanCopilot

Lean Copilot is a tool that enables the use of large language models (LLMs) in Lean for proof automation. It provides features such as suggesting tactics/premises, searching for proofs, and running inference of LLMs. Users can utilize built-in models from LeanDojo or bring their own models to run locally or on the cloud. The tool supports platforms like Linux, macOS, and Windows WSL, with optional CUDA and cuDNN for GPU acceleration. Advanced users can customize behavior using Tactic APIs and Model APIs. Lean Copilot also allows users to bring their own models through ExternalGenerator or ExternalEncoder. The tool comes with caveats such as occasional crashes and issues with premise selection and proof search. Users can get in touch through GitHub Discussions for questions, bug reports, feature requests, and suggestions. The tool is designed to enhance theorem proving in Lean using LLMs.

github

: 1.0k

yalm

Yalm (Yet Another Language Model) is an LLM inference implementation in C++/CUDA, emphasizing performance engineering, documentation, scientific optimizations, and readability. It is not for production use and has been tested on Mistral-v0.2 and Llama-3.2. Requires C++20-compatible compiler, CUDA toolkit, and LLM safetensor weights in huggingface format converted to .yalm file.

github

: 190

PDEBench

PDEBench provides a diverse and comprehensive set of benchmarks for scientific machine learning, including challenging and realistic physical problems. The repository consists of code for generating datasets, uploading and downloading datasets, training and evaluating machine learning models as baselines. It features a wide range of PDEs, realistic and difficult problems, ready-to-use datasets with various conditions and parameters. PDEBench aims for extensibility and invites participation from the SciML community to improve and extend the benchmark.

github

: 793

storm

STORM is a LLM system that writes Wikipedia-like articles from scratch based on Internet search. While the system cannot produce publication-ready articles that often require a significant number of edits, experienced Wikipedia editors have found it helpful in their pre-writing stage. **Try out our [live research preview](https://storm.genie.stanford.edu/) to see how STORM can help your knowledge exploration journey and please provide feedback to help us improve the system 🙏!**

github

: 17.0k

rtdl-num-embeddings

This repository provides the official implementation of the paper 'On Embeddings for Numerical Features in Tabular Deep Learning'. It focuses on transforming scalar continuous features into vectors before integrating them into the main backbone of tabular neural networks, showcasing improved performance. The embeddings for continuous features are shown to enhance the performance of tabular DL models and are applicable to various conventional backbones, offering efficiency comparable to Transformer-based models. The repository includes Python packages for practical usage, exploration of metrics and hyperparameters, and reproducing reported results for different algorithms and datasets.

github

: 287

Co-LLM-Agents

This repository contains code for building cooperative embodied agents modularly with large language models. The agents are trained to perform tasks in two different environments: ThreeDWorld Multi-Agent Transport (TDW-MAT) and Communicative Watch-And-Help (C-WAH). TDW-MAT is a multi-agent environment where agents must transport objects to a goal position using containers. C-WAH is an extension of the Watch-And-Help challenge, which enables agents to send messages to each other. The code in this repository can be used to train agents to perform tasks in both of these environments.

github

: 202

topicGPT

TopicGPT is a repository containing scripts and prompts for the paper 'TopicGPT: Topic Modeling by Prompting Large Language Models' (NAACL'24). The 'topicgpt_python' package offers functions to generate high-level and specific topics, refine topics, assign topics to input text, and correct generated topics. It supports various APIs like OpenAI, VertexAI, Azure, Gemini, and vLLM for inference. Users can prepare data in JSONL format, run the pipeline using provided scripts, and evaluate topic alignment with ground-truth labels.

github

: 269

generative-models

Generative Models by Stability AI is a repository that provides various generative models for research purposes. It includes models like Stable Video 4D (SV4D) for video synthesis, Stable Video 3D (SV3D) for multi-view synthesis, SDXL-Turbo for text-to-image generation, and more. The repository focuses on modularity and implements a config-driven approach for building and combining submodules. It supports training with PyTorch Lightning and offers inference demos for different models. Users can access pre-trained models like SDXL-base-1.0 and SDXL-refiner-1.0 under a CreativeML Open RAIL++-M license. The codebase also includes tools for invisible watermark detection in generated images.

github

: 23.6k

lantern

Lantern is an open-source PostgreSQL database extension designed to store vector data, generate embeddings, and handle vector search operations efficiently. It introduces a new index type called 'lantern_hnsw' for vector columns, which speeds up 'ORDER BY ... LIMIT' queries. Lantern utilizes the state-of-the-art HNSW implementation called usearch. Users can easily install Lantern using Docker, Homebrew, or precompiled binaries. The tool supports various distance functions, index construction parameters, and operator classes for efficient querying. Lantern offers features like embedding generation, interoperability with pgvector, parallel index creation, and external index graph generation. It aims to provide superior performance metrics compared to other similar tools and has a roadmap for future enhancements such as cloud-hosted version, hardware-accelerated distance metrics, industry-specific application templates, and support for version control and A/B testing of embeddings.

github

: 756

vidur

Vidur is a high-fidelity and extensible LLM inference simulator designed for capacity planning, deployment configuration optimization, testing new research ideas, and studying system performance of models under different workloads and configurations. It supports various models and devices, offers chrome trace exports, and can be set up using mamba, venv, or conda. Users can run the simulator with various parameters and monitor metrics using wandb. Contributions are welcome, subject to a Contributor License Agreement and adherence to the Microsoft Open Source Code of Conduct.

github

: 241

For similar tasks

llm-analysis

github

: 300

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 855

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.3k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 30.6k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675

llm-analysis

README:

llm-analysis

Overview

Examples

Quick Start

Using the LLMAnalysis class

Using the Entry Point Functions for Command Line

How to Set FLOPS and Memory Efficiency

Current Scope and Limitations

Parallelism Scheme

Communication

Activation Recomputation

Data Types

Fine-Tuning

Assumptions in Inference

TODOs (stay tuned 📻)

Citation

Contributing

Useful Links

For Tasks:

For Jobs:

Alternative AI tools for llm-analysis

Similar Open Source Tools

llm-analysis

raft

ai8x-synthesis

ai8x-training

kvpress

paxml

LeanCopilot

yalm

PDEBench

storm

rtdl-num-embeddings

Co-LLM-Agents

topicGPT

generative-models

lantern

vidur

For similar tasks

llm-analysis

For similar jobs

weave

LLMStack

VisionCraft

kaito

PyRIT

tabby

spear

Magick

Using the `LLMAnalysis` class