clearml-fractional-gpu
ClearML Fractional GPU - Run multiple containers on the same GPU with driver level memory limitation β¨ and compute time-slicing
Stars: 56
ClearML Fractional GPU is a tool designed to optimize GPU resource utilization by allowing multiple containers to run on the same GPU with driver-level memory limitation and compute time-slicing. It supports CUDA 11.x & CUDA 12.x, preventing greedy processes from grabbing the entire GPU memory. The tool offers options like Dynamic GPU Slicing, Container-based Memory Limits, and Kubernetes-based Static MIG Slicing to enhance hardware utilization and workload performance for AI development.
README:
Run multiple containers on the same GPU with driver level memory limitation β¨ and compute time-slicing π
π Leave a star to support the project! π
Sharing high-end GPUs or even prosumer & consumer GPUs between multiple users is the most cost-effective
way to accelerate AI development. Unfortunately, until now the
only existing solution applied for MIG/Slicing high-end GPUs (A100+) and required Kubernetes,
π₯ π Welcome To Container Based Fractional GPU For Any Nvidia Card! π π₯
We present pre-packaged containers supporting CUDA 11.x & CUDA 12.x with pre-built hard memory limitation! This means multiple containers can be launched on the same GPU, ensuring one user cannot allocate the entire host GPU memory! (No more greedy processes grabbing the entire GPU memory! Finally we have a driver level hard limiting memory option).
ClearML offers several options to optimize GPU resource utilization by partitioning GPUs:
- Dynamic GPU Slicing: On-demand GPU slicing per task for both MIG and non-MIG devices (available under the ClearML Enterprise plan):
- Container-based Memory Limits (this repository): Use pre-packaged containers with built-in memory limits to run multiple containers on the same GPU (available as part of the ClearML open source offering).
- Kubernetes-based Static MIG Slicing: Set up Kubernetes support for NVIDIA MIG (Multi-Instance GPU) to define GPU fractions for specific workloads (available as part of the ClearML open source offering).
With these options, ClearML enables running AI workloads with optimized hardware utilization and workload performance. This repository covers container-based fractional GPUs. For more information on ClearML's fractional GPU offerings, see the ClearML documentation.
Pick the container that works for you and launch it:
docker run -it --gpus 0 --ipc=host --pid=host clearml/fractional-gpu:u22-cu12.3-8gb bash
To verify fraction GPU memory limit is working correctly, run inside the container:
nvidia-smi
Here is an example output from A100 GPU:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 A100-PCIE-40GB Off | 00000000:01:00.0 Off | N/A |
| 32% 33C P0 66W / 250W | 0MiB / 8128MiB | 3% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
Memory Limit | CUDA Ver | Ubuntu Ver | Docker Image |
---|---|---|---|
12 GiB | 12.3 | 22.04 | clearml/fractional-gpu:u22-cu12.3-12gb |
12 GiB | 12.3 | 20.04 | clearml/fractional-gpu:u20-cu12.3-12gb |
12 GiB | 11.7 | 22.04 | clearml/fractional-gpu:u22-cu11.7-12gb |
12 GiB | 11.1 | 20.04 | clearml/fractional-gpu:u20-cu11.1-12gb |
8 GiB | 12.3 | 22.04 | clearml/fractional-gpu:u22-cu12.3-8gb |
8 GiB | 12.3 | 20.04 | clearml/fractional-gpu:u20-cu12.3-8gb |
8 GiB | 11.7 | 22.04 | clearml/fractional-gpu:u22-cu11.7-8gb |
8 GiB | 11.1 | 20.04 | clearml/fractional-gpu:u20-cu11.1-8gb |
4 GiB | 12.3 | 22.04 | clearml/fractional-gpu:u22-cu12.3-4gb |
4 GiB | 12.3 | 20.04 | clearml/fractional-gpu:u20-cu12.3-4gb |
4 GiB | 11.7 | 22.04 | clearml/fractional-gpu:u22-cu11.7-4gb |
4 GiB | 11.1 | 20.04 | clearml/fractional-gpu:u20-cu11.1-4gb |
2 GiB | 12.3 | 22.04 | clearml/fractional-gpu:u22-cu12.3-2gb |
2 GiB | 12.3 | 20.04 | clearml/fractional-gpu:u20-cu12.3-2gb |
2 GiB | 11.7 | 22.04 | clearml/fractional-gpu:u22-cu11.7-2gb |
2 GiB | 11.1 | 20.04 | clearml/fractional-gpu:u20-cu11.1-2gb |
[!IMPORTANT]
You must execute the container with
--pid=host
!
[!NOTE]
--pid=host
is required to allow the driver to differentiate between the container's processes and other host processes when limiting memory / utilization usage
[!TIP]
ClearML-Agent users add
[--pid=host]
to youragent.extra_docker_arguments
section in your config file
Build your own containers and inherit form the original containers.
You can find a few examples here.
Fractional GPU containers can be used on bare-metal executions as well as Kubernetes PODs. Yes! By using one of the Fractional GPU containers you can limit the memory consumption of your Job/Pod and easily share GPUs without fearing they will memory crash one another!
Here's a simple Kubernetes POD template:
apiVersion: v1
kind: Pod
metadata:
name: train-pod
labels:
app: trainme
spec:
hostPID: true
containers:
- name: train-container
image: clearml/fractional-gpu:u22-cu12.3-8gb
command: ['python3', '-c', 'print(f"Free GPU Memory: (free, global) {torch.cuda.mem_get_info()}")']
[!IMPORTANT]
You must execute the pod with
hostPID: true
!
[!NOTE]
hostPID: true
is required to allow the driver to differentiate between the pod's processes and other host processes when limiting memory / utilization usage
The containers support Nvidia drivers <= 545.x.x
.
We will keep updating & supporting new drivers as they continue to be released
Supported GPUs: RTX series 10, 20, 30, 40, A series, and Data-Center P100, A100, A10/A40, L40/s, H100
Limitations: Windows Host machines are currently not supported. If this is important for you, leave a request in the Issues section
-
Q: Will running
nvidia-smi
inside the container report the local processes GPU consumption?
A: Yes,nvidia-smi
is communicating directly with the low-level drivers and reports both accurate container GPU memory as well as the container local memory limitation.
Notice GPU utilization will be the global (i.e. host side) GPU utilization and not the specific local container GPU utilization. -
Q: How do I make sure my Python / Pytorch / Tensorflow are actually memory limited?
A: For PyTorch you can run:
import torch
print(f'Free GPU Memory: (free, global) {torch.cuda.mem_get_info()}')
Numba example:
from numba import cuda
print(f'Free GPU Memory: {cuda.current_context().get_memory_info()}')
-
Q: Can the limitation be broken by a user?
A: We are sure a malicious user will find a way. It was never our intention to protect against malicious users.
If you have a malicious user with access to your machines, fractional GPUs are not your number 1 problem π -
Q: How can I programmatically detect the memory limitation?
A: You can check the OS environment variableGPU_MEM_LIMIT_GB
.
Notice that changing it will not remove or reduce the limitation. -
Q: Is running the container with
--pid=host
secure / safe?
A: It should be both secure and safe. The main caveat from a security perspective is that a container process can see any command line running on the host system. If a process command line contains a "secret" then yes, this might become a potential data leak. Notice that passing "secrets" in the command line is ill-advised, and hence we do not consider it a security risk. That said if security is key, the enterprise edition (see below) eliminate the need to run withpid-host
and thus fully secure. -
Q: Can you run the container without
--pid=host
?
A: You can! But you will have to use the enterprise version of the clearml-fractional-gpu container (otherwise the memory limit is applied system wide instead of container wide). If this feature is important for you, please contact ClearML sales & support.
The license to use ClearML is granted for research or development purposes only. ClearML may be used for educational, personal, or internal commercial use.
An expanded Commercial license for use within a product or service is available as part of the ClearML Scale or Enterprise solution.
ClearML offers enterprise and commercial license adding many additional features on top of fractional GPUs, these include orchestration, priority queues, quota management, compute cluster dashboard, dataset management & experiment management, as well as enterprise grade security and support. Learn more about ClearML Orchestration or talk to us directly at ClearML sales.
Tell everyone about it! #ClearMLFractionalGPU
Join our Slack Channel
Tell us when things are not working, and help us debug it on the Issues Page
This product is brought to you by the ClearML team with β€οΈ
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for clearml-fractional-gpu
Similar Open Source Tools
clearml-fractional-gpu
ClearML Fractional GPU is a tool designed to optimize GPU resource utilization by allowing multiple containers to run on the same GPU with driver-level memory limitation and compute time-slicing. It supports CUDA 11.x & CUDA 12.x, preventing greedy processes from grabbing the entire GPU memory. The tool offers options like Dynamic GPU Slicing, Container-based Memory Limits, and Kubernetes-based Static MIG Slicing to enhance hardware utilization and workload performance for AI development.
dora
Dataflow-oriented robotic application (dora-rs) is a framework that makes creation of robotic applications fast and simple. Building a robotic application can be summed up as bringing together hardwares, algorithms, and AI models, and make them communicate with each others. At dora-rs, we try to: make integration of hardware and software easy by supporting Python, C, C++, and also ROS2. make communication low latency by using zero-copy Arrow messages. dora-rs is still experimental and you might experience bugs, but we're working very hard to make it stable as possible.
TPI-LLM
TPI-LLM (Tensor Parallelism Inference for Large Language Models) is a system designed to bring LLM functions to low-resource edge devices, addressing privacy concerns by enabling LLM inference on edge devices with limited resources. It leverages multiple edge devices for inference through tensor parallelism and a sliding window memory scheduler to minimize memory usage. TPI-LLM demonstrates significant improvements in TTFT and token latency compared to other models, and plans to support infinitely large models with low token latency in the future.
TinyLLM
TinyLLM is a project that helps build a small locally hosted language model with a web interface using consumer-grade hardware. It supports multiple language models, builds a local OpenAI API web service, and serves a Chatbot web interface with customizable prompts. The project requires specific hardware and software configurations for optimal performance. Users can run a local language model using inference servers like vLLM, llama-cpp-python, and Ollama. The Chatbot feature allows users to interact with the language model through a web-based interface, supporting features like summarizing websites, displaying news headlines, stock prices, weather conditions, and using vector databases for queries.
StableToolBench
StableToolBench is a new benchmark developed to address the instability of Tool Learning benchmarks. It aims to balance stability and reality by introducing features like Virtual API System, Solvable Queries, and Stable Evaluation System. The benchmark ensures consistency through a caching system and API simulators, filters queries based on solvability using LLMs, and evaluates model performance using GPT-4 with metrics like Solvable Pass Rate and Solvable Win Rate.
StableToolBench
StableToolBench is a new benchmark developed to address the instability of Tool Learning benchmarks. It aims to balance stability and reality by introducing features such as a Virtual API System with caching and API simulators, a new set of solvable queries determined by LLMs, and a Stable Evaluation System using GPT-4. The Virtual API Server can be set up either by building from source or using a prebuilt Docker image. Users can test the server using provided scripts and evaluate models with Solvable Pass Rate and Solvable Win Rate metrics. The tool also includes model experiments results comparing different models' performance.
SimpleAICV_pytorch_training_examples
SimpleAICV_pytorch_training_examples is a repository that provides simple training and testing examples for various computer vision tasks such as image classification, object detection, semantic segmentation, instance segmentation, knowledge distillation, contrastive learning, masked image modeling, OCR text detection, OCR text recognition, human matting, salient object detection, interactive segmentation, image inpainting, and diffusion model tasks. The repository includes support for multiple datasets and networks, along with instructions on how to prepare datasets, train and test models, and use gradio demos. It also offers pretrained models and experiment records for download from huggingface or Baidu-Netdisk. The repository requires specific environments and package installations to run effectively.
Liger-Kernel
Liger Kernel is a collection of Triton kernels designed for LLM training, increasing training throughput by 20% and reducing memory usage by 60%. It includes Hugging Face Compatible modules like RMSNorm, RoPE, SwiGLU, CrossEntropy, and FusedLinearCrossEntropy. The tool works with Flash Attention, PyTorch FSDP, and Microsoft DeepSpeed, aiming to enhance model efficiency and performance for researchers, ML practitioners, and curious novices.
MooER
MooER (ζ©θ³) is an LLM-based speech recognition and translation model developed by Moore Threads. It allows users to transcribe speech into text (ASR) and translate speech into other languages (AST) in an end-to-end manner. The model was trained using 5K hours of data and is now also available with an 80K hours version. MooER is the first LLM-based speech model trained and inferred using domestic GPUs. The repository includes pretrained models, inference code, and a Gradio demo for a better user experience.
thinc
Thinc is a lightweight deep learning library that offers an elegant, type-checked, functional-programming API for composing models, with support for layers defined in other frameworks such as PyTorch, TensorFlow and MXNet. You can use Thinc as an interface layer, a standalone toolkit or a flexible way to develop new models.
1filellm
1filellm is a command-line data aggregation tool designed for LLM ingestion. It aggregates and preprocesses data from various sources into a single text file, facilitating the creation of information-dense prompts for large language models. The tool supports automatic source type detection, handling of multiple file formats, web crawling functionality, integration with Sci-Hub for research paper downloads, text preprocessing, and token count reporting. Users can input local files, directories, GitHub repositories, pull requests, issues, ArXiv papers, YouTube transcripts, web pages, Sci-Hub papers via DOI or PMID. The tool provides uncompressed and compressed text outputs, with the uncompressed text automatically copied to the clipboard for easy pasting into LLMs.
qserve
QServe is a serving system designed for efficient and accurate Large Language Models (LLM) on GPUs with W4A8KV4 quantization. It achieves higher throughput compared to leading industry solutions, allowing users to achieve A100-level throughput on cheaper L40S GPUs. The system introduces the QoQ quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache, addressing runtime overhead challenges. QServe improves serving throughput for various LLM models by implementing compute-aware weight reordering, register-level parallelism, and fused attention memory-bound techniques.
Consistency_LLM
Consistency Large Language Models (CLLMs) is a family of efficient parallel decoders that reduce inference latency by efficiently decoding multiple tokens in parallel. The models are trained to perform efficient Jacobi decoding, mapping any randomly initialized token sequence to the same result as auto-regressive decoding in as few steps as possible. CLLMs have shown significant improvements in generation speed on various tasks, achieving up to 3.4 times faster generation. The tool provides a seamless integration with other techniques for efficient Large Language Model (LLM) inference, without the need for draft models or architectural modifications.
Qwen
Qwen is a series of large language models developed by Alibaba DAMO Academy. It outperforms the baseline models of similar model sizes on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., which evaluate the modelsβ capabilities on natural language understanding, mathematic problem solving, coding, etc. Qwen models outperform the baseline models of similar model sizes on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., which evaluate the modelsβ capabilities on natural language understanding, mathematic problem solving, coding, etc. Qwen-72B achieves better performance than LLaMA2-70B on all tasks and outperforms GPT-3.5 on 7 out of 10 tasks.
mllm
mllm is a fast and lightweight multimodal LLM inference engine for mobile and edge devices. It is a Plain C/C++ implementation without dependencies, optimized for multimodal LLMs like fuyu-8B, and supports ARM NEON and x86 AVX2. The engine offers 4-bit and 6-bit integer quantization, making it suitable for intelligent personal agents, text-based image searching/retrieval, screen VQA, and various mobile applications without compromising user privacy.
AiOS
AiOS is a tool for human pose and shape estimation, performing human localization and SMPL-X estimation in a progressive manner. It consists of body localization, body refinement, and whole-body refinement stages. Users can download datasets for evaluation, SMPL-X body models, and AiOS checkpoint. Installation involves creating a conda virtual environment, installing PyTorch, torchvision, Pytorch3D, MMCV, and other dependencies. Inference requires placing the video for inference and pretrained models in specific directories. Test results are provided for NMVE, NMJE, MVE, and MPJPE on datasets like BEDLAM and AGORA. Users can run scripts for AGORA validation, AGORA test leaderboard, and BEDLAM leaderboard. The tool acknowledges codes from MMHuman3D, ED-Pose, and SMPLer-X.
For similar tasks
clearml-fractional-gpu
ClearML Fractional GPU is a tool designed to optimize GPU resource utilization by allowing multiple containers to run on the same GPU with driver-level memory limitation and compute time-slicing. It supports CUDA 11.x & CUDA 12.x, preventing greedy processes from grabbing the entire GPU memory. The tool offers options like Dynamic GPU Slicing, Container-based Memory Limits, and Kubernetes-based Static MIG Slicing to enhance hardware utilization and workload performance for AI development.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.