CogVideo
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Stars: 10269
CogVideo is an open-source repository that provides pretrained text-to-video models for generating videos based on input text. It includes models like CogVideoX-2B and CogVideo, offering powerful video generation capabilities. The repository offers tools for inference, fine-tuning, and model conversion, along with demos showcasing the model's capabilities through CLI, web UI, and online experiences. CogVideo aims to facilitate the creation of high-quality videos from textual descriptions, catering to a wide range of applications.
README:
Experience the CogVideoX-5B model online at 🤗 Huggingface Space or 🤖 ModelScope Space
📚 View the paper and user guide
📍 Visit QingYing and API Platform to experience larger-scale commercial video generation models.
- 🔥🔥 News:
2025/01/08
: We have updated the code forLora
fine-tuning based on thediffusers
version model, which uses less GPU memory. For more details, please see here. - 🔥 News:
2024/11/15
: We released theCogVideoX1.5
model in the diffusers version. Only minor parameter adjustments are needed to continue using previous code. - 🔥 News:
2024/11/08
: We have released the CogVideoX1.5 model. CogVideoX1.5 is an upgraded version of the open-source model CogVideoX. The CogVideoX1.5-5B series supports 10-second videos with higher resolution, and CogVideoX1.5-5B-I2V supports video generation at any resolution. The SAT code has already been updated, while the diffusers version is still under adaptation. Download the SAT version code here. - 🔥 News:
2024/10/13
: A more cost-effective fine-tuning framework forCogVideoX-5B
that works with a single 4090 GPU, cogvideox-factory, has been released. It supports fine-tuning with multiple resolutions. Feel free to use it! - 🔥 News:
2024/10/10
: We have updated our technical report. Please click here to view it. More training details and a demo have been added. To see the demo, click here.- 🔥 News:2024/10/09
: We have publicly released the technical documentation for CogVideoX fine-tuning on Feishu, further increasing distribution flexibility. All examples in the public documentation can be fully reproduced. - 🔥 News:
2024/9/19
: We have open-sourced the CogVideoX series image-to-video model CogVideoX-5B-I2V. This model can take an image as a background input and generate a video combined with prompt words, offering greater controllability. With this, the CogVideoX series models now support three tasks: text-to-video generation, video continuation, and image-to-video generation. Welcome to try it online at Experience. - 🔥
2024/9/19
: The Caption model CogVLM2-Caption, used in the training process of CogVideoX to convert video data into text descriptions, has been open-sourced. Welcome to download and use it. - 🔥
2024/8/27
: We have open-sourced a larger model in the CogVideoX series, CogVideoX-5B. We have significantly optimized the model's inference performance, greatly lowering the inference threshold. You can run CogVideoX-2B on older GPUs likeGTX 1080TI
, and CogVideoX-5B on desktop GPUs likeRTX 3060
. Please strictly follow the requirements to update and install dependencies, and refer to cli_demo for inference code. Additionally, the open-source license for the CogVideoX-2B model has been changed to the Apache 2.0 License. - 🔥
2024/8/6
: We have open-sourced 3D Causal VAE, used for CogVideoX-2B, which can reconstruct videos with almost no loss. - 🔥
2024/8/6
: We have open-sourced the first model of the CogVideoX series video generation models, **CogVideoX-2B **. - 🌱 Source:
2022/5/19
: We have open-sourced the CogVideo video generation model (now you can see it in theCogVideo
branch). This is the first open-source large Transformer-based text-to-video generation model. You can access the ICLR'23 paper for technical details.
Jump to a specific section:
- Quick Start
- Gallery
- Model Introduction
- Friendly Links
- Project Structure
- CogVideo(ICLR'23)
- Citation
- Model-License
Before running the model, please refer to this guide to see how we use large models like GLM-4 (or other comparable products, such as GPT-4) to optimize the model. This is crucial because the model is trained with long prompts, and a good prompt directly impacts the quality of the video generation.
Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.
Follow instructions in sat_demo: Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.
Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.
pip install -r requirements.txt
Then follow diffusers_demo: A more detailed explanation of the inference code, mentioning the significance of common parameters.
For more details on quantized inference, please refer to diffusers-torchao. With Diffusers and TorchAO, quantized inference is also possible leading to memory-efficient inference as well as speedup in some cases when compiled. A full list of memory and time benchmarks with various settings on A100 and H100 has been published at diffusers-torchao.
To view the corresponding prompt words for the gallery, please click here
CogVideoX is an open-source version of the video generation model originating from QingYing. The table below displays the list of video generation models we currently offer, along with their foundational information.
Model Name | CogVideoX1.5-5B (Latest) | CogVideoX1.5-5B-I2V (Latest) | CogVideoX-2B | CogVideoX-5B | CogVideoX-5B-I2V |
---|---|---|---|---|---|
Release Date | November 8, 2024 | November 8, 2024 | August 6, 2024 | August 27, 2024 | September 19, 2024 |
Video Resolution | 1360 * 768 | Min(W, H) = 768 768 ≤ Max(W, H) ≤ 1360 Max(W, H) % 16 = 0 |
720 * 480 | ||
Inference Precision | BF16 (Recommended), FP16, FP32, FP8*, INT8, Not supported: INT4 | FP16*(Recommended), BF16, FP32, FP8*, INT8, Not supported: INT4 | BF16 (Recommended), FP16, FP32, FP8*, INT8, Not supported: INT4 | ||
Single GPU Memory Usage |
SAT BF16: 76GB diffusers BF16: from 10GB* diffusers INT8(torchao): from 7GB* |
SAT FP16: 18GB diffusers FP16: 4GB minimum* diffusers INT8 (torchao): 3.6GB minimum* |
SAT BF16: 26GB diffusers BF16 : 5GB minimum* diffusers INT8 (torchao): 4.4GB minimum* |
||
Multi-GPU Memory Usage |
BF16: 24GB* using diffusers |
FP16: 10GB* using diffusers |
BF16: 15GB* using diffusers |
||
Inference Speed (Step = 50, FP/BF16) |
Single A100: ~1000 seconds (5-second video) Single H100: ~550 seconds (5-second video) |
Single A100: ~90 seconds Single H100: ~45 seconds |
Single A100: ~180 seconds Single H100: ~90 seconds |
||
Prompt Language | English* | ||||
Prompt Token Limit | 224 Tokens | 226 Tokens | |||
Video Length | 5 seconds or 10 seconds | 6 seconds | |||
Frame Rate | 16 frames / second | 8 frames / second | |||
Position Encoding | 3d_rope_pos_embed | 3d_sincos_pos_embed | 3d_rope_pos_embed | 3d_rope_pos_embed + learnable_pos_embed | |
Download Link (Diffusers) |
🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
Download Link (SAT) |
🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
SAT |
Data Explanation
- While testing using the diffusers library, all optimizations included in the diffusers library were enabled. This scheme has not been tested for actual memory usage on devices outside of NVIDIA A100 / H100 architectures. Generally, this scheme can be adapted to all NVIDIA Ampere architecture and above devices. If optimizations are disabled, memory consumption will multiply, with peak memory usage being about 3 times the value in the table. However, speed will increase by about 3-4 times. You can selectively disable some optimizations, including:
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
- For multi-GPU inference, the
enable_sequential_cpu_offload()
optimization needs to be disabled. - Using INT8 models will slow down inference, which is done to accommodate lower-memory GPUs while maintaining minimal video quality loss, though inference speed will significantly decrease.
- The CogVideoX-2B model was trained in
FP16
precision, and all CogVideoX-5B models were trained inBF16
precision. We recommend using the precision in which the model was trained for inference. -
PytorchAO and Optimum-quanto can be
used to quantize the text encoder, transformer, and VAE modules to reduce the memory requirements of CogVideoX. This
allows the model to run on free T4 Colabs or GPUs with smaller memory! Also, note that TorchAO quantization is fully
compatible with
torch.compile
, which can significantly improve inference speed. FP8 precision must be used on devices with NVIDIA H100 and above, requiring source installation oftorch
,torchao
Python packages. CUDA 12.4 is recommended. - The inference speed tests also used the above memory optimization scheme. Without memory optimization, inference speed
increases by about 10%. Only the
diffusers
version of the model supports quantization. - The model only supports English input; other languages can be translated into English for use via large model refinement.
We highly welcome contributions from the community and actively contribute to the open-source community. The following works have already been adapted for CogVideoX, and we invite everyone to use them:
- CogVideoX-Fun: CogVideoX-Fun is a modified pipeline based on the CogVideoX architecture, supporting flexible resolutions and multiple launch methods.
- CogStudio: A separate repository for CogVideo's Gradio Web UI, which supports more functional Web UIs.
- Xorbits Inference: A powerful and comprehensive distributed inference framework, allowing you to easily deploy your own models or the latest cutting-edge open-source models with just one click.
- ComfyUI-CogVideoXWrapper Use the ComfyUI framework to integrate CogVideoX into your workflow.
- VideoSys: VideoSys provides a user-friendly, high-performance infrastructure for video generation, with full pipeline support and continuous integration of the latest models and techniques.
- AutoDL Space: A one-click deployment Huggingface Space image provided by community members.
- Interior Design Fine-Tuning Model: is a fine-tuned model based on CogVideoX, specifically designed for interior design.
-
xDiT: xDiT is a scalable inference engine for Diffusion Transformers (DiTs)
on multiple GPU Clusters. xDiT supports real-time image and video generations services.
cogvideox-factory: A cost-effective
fine-tuning framework for CogVideoX, compatible with the
diffusers
version model. Supports more resolutions, and fine-tuning CogVideoX-5B can be done with a single 4090 GPU. - CogVideoX-Interpolation: A pipeline based on the modified CogVideoX structure, aimed at providing greater flexibility for keyframe interpolation generation.
- DiffSynth-Studio: DiffSynth Studio is a diffusion engine. It has restructured the architecture, including text encoders, UNet, VAE, etc., enhancing computational performance while maintaining compatibility with open-source community models. The framework has been adapted for CogVideoX.
- CogVideoX-Controlnet: A simple ControlNet module code that includes the CogVideoX model.
- VideoTuna: VideoTuna is the first repo that integrates multiple AI video generation models for text-to-video, image-to-video, text-to-image generation.
- ConsisID: An identity-preserving text-to-video generation model, bases on CogVideoX-5B, which keep the face consistent in the generated video by frequency decomposition.
- A Step by Step Tutorial: A step-by-step guide on installing and optimizing the CogVideoX1.5-5B-I2V model in Windows and cloud environments. Special thanks to the FurkanGozukara for his effort and support!
This open-source repository will guide developers to quickly get started with the basic usage and fine-tuning examples of the CogVideoX open-source model.
Here provide three projects that can be run directly on free Colab T4 instances:
- CogVideoX-5B-T2V-Colab.ipynb: CogVideoX-5B Text-to-Video Colab code.
- CogVideoX-5B-T2V-Int8-Colab.ipynb: CogVideoX-5B Quantized Text-to-Video Inference Colab code, which takes about 30 minutes per run.
- CogVideoX-5B-I2V-Colab.ipynb: CogVideoX-5B Image-to-Video Colab code.
- CogVideoX-5B-V2V-Colab.ipynb: CogVideoX-5B Video-to-Video Colab code.
- dcli_demo: A more detailed inference code explanation, including the significance of common parameters. All of this is covered here.
- cli_demo_quantization: Quantized model inference code that can run on devices with lower memory. You can also modify this code to support running CogVideoX models in FP8 precision.
- diffusers_vae_demo: Code for running VAE inference separately.
- space demo: The same GUI code as used in the Huggingface Space, with frame interpolation and super-resolution tools integrated.
- convert_demo: How to convert user input into long-form input suitable for CogVideoX. Since CogVideoX is trained on long texts, we need to transform the input text distribution to match the training data using an LLM. The script defaults to using GLM-4, but it can be replaced with GPT, Gemini, or any other large language model.
- gradio_web_demo: A simple Gradio web application demonstrating how to use the CogVideoX-2B / 5B model to generate videos. Similar to our Huggingface Space, you can use this script to run a simple web application for video generation.
- finetune_demo: Fine-tuning scheme and details of the diffusers version of the CogVideoX model.
- sat_demo: Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.
This folder contains some tools for model conversion / caption generation, etc.
- convert_weight_sat2hf: Converts SAT model weights to Huggingface model weights.
- caption_demo: Caption tool, a model that understands videos and outputs descriptions in text.
- export_sat_lora_weight: SAT fine-tuning model export tool, exports the SAT Lora Adapter in diffusers format.
- load_cogvideox_lora: Tool code for loading the diffusers version of fine-tuned Lora Adapter.
- llm_flux_cogvideox: Automatically generate videos using an open-source local large language model + Flux + CogVideoX.
- parallel_inference_xdit: Supported by xDiT, parallelize the video generation process on multiple GPUs.
The official repo for the paper: CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers is on the CogVideo branch
CogVideo is able to generate relatively high-frame-rate videos. A 4-second clip of 32 frames is shown below.
The demo for CogVideo is at https://models.aminer.cn/cogvideo, where you can get hands-on practice on text-to-video generation. The original input is in Chinese.
🌟 If you find our work helpful, please leave us a star and cite our paper.
@article{yang2024cogvideox,
title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
journal={arXiv preprint arXiv:2408.06072},
year={2024}
}
@article{hong2022cogvideo,
title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},
author={Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang, Jie},
journal={arXiv preprint arXiv:2205.15868},
year={2022}
}
We welcome your contributions! You can click here for more information.
The code in this repository is released under the Apache 2.0 License.
The CogVideoX-2B model (including its corresponding Transformers module and VAE module) is released under the Apache 2.0 License.
The CogVideoX-5B model (Transformers module, include I2V and T2V) is released under the CogVideoX LICENSE.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for CogVideo
Similar Open Source Tools
CogVideo
CogVideo is an open-source repository that provides pretrained text-to-video models for generating videos based on input text. It includes models like CogVideoX-2B and CogVideo, offering powerful video generation capabilities. The repository offers tools for inference, fine-tuning, and model conversion, along with demos showcasing the model's capabilities through CLI, web UI, and online experiences. CogVideo aims to facilitate the creation of high-quality videos from textual descriptions, catering to a wide range of applications.
postgresml
PostgresML is a powerful Postgres extension that seamlessly combines data storage and machine learning inference within your database. It enables running machine learning and AI operations directly within PostgreSQL, leveraging GPU acceleration for faster computations, integrating state-of-the-art large language models, providing built-in functions for text processing, enabling efficient similarity search, offering diverse ML algorithms, ensuring high performance, scalability, and security, supporting a wide range of NLP tasks, and seamlessly integrating with existing PostgreSQL tools and client libraries.
Controllable-RAG-Agent
This repository contains a sophisticated deterministic graph-based solution for answering complex questions using a controllable autonomous agent. The solution is designed to ensure that answers are solely based on the provided data, avoiding hallucinations. It involves various steps such as PDF loading, text preprocessing, summarization, database creation, encoding, and utilizing large language models. The algorithm follows a detailed workflow involving planning, retrieval, answering, replanning, content distillation, and performance evaluation. Heuristics and techniques implemented focus on content encoding, anonymizing questions, task breakdown, content distillation, chain of thought answering, verification, and model performance evaluation.
LLM-Zero-to-Hundred
LLM-Zero-to-Hundred is a repository showcasing various applications of LLM chatbots and providing insights into training and fine-tuning Language Models. It includes projects like WebGPT, RAG-GPT, WebRAGQuery, LLM Full Finetuning, RAG-Master LLamaindex vs Langchain, open-source-RAG-GEMMA, and HUMAIN: Advanced Multimodal, Multitask Chatbot. The projects cover features like ChatGPT-like interaction, RAG capabilities, image generation and understanding, DuckDuckGo integration, summarization, text and voice interaction, and memory access. Tutorials include LLM Function Calling and Visualizing Text Vectorization. The projects have a general structure with folders for README, HELPER, .env, configs, data, src, images, and utils.
Whisper-TikTok
Discover Whisper-TikTok, an innovative AI-powered tool that leverages the prowess of Edge TTS, OpenAI-Whisper, and FFMPEG to craft captivating TikTok videos. Whisper-TikTok effortlessly generates accurate transcriptions from audio files and integrates Microsoft Edge Cloud Text-to-Speech API for vibrant voiceovers. The program orchestrates the synthesis of videos using a structured JSON dataset, generating mesmerizing TikTok content in minutes.
premsql
PremSQL is an open-source library designed to help developers create secure, fully local Text-to-SQL solutions using small language models. It provides essential tools for building and deploying end-to-end Text-to-SQL pipelines with customizable components, ideal for secure, autonomous AI-powered data analysis. The library offers features like Local-First approach, Customizable Datasets, Robust Executors and Evaluators, Advanced Generators, Error Handling and Self-Correction, Fine-Tuning Support, and End-to-End Pipelines. Users can fine-tune models, generate SQL queries from natural language inputs, handle errors, and evaluate model performance against predefined metrics. PremSQL is extendible for customization and private data usage.
qdrant
Qdrant is a vector similarity search engine and vector database. It is written in Rust, which makes it fast and reliable even under high load. Qdrant can be used for a variety of applications, including: * Semantic search * Image search * Product recommendations * Chatbots * Anomaly detection Qdrant offers a variety of features, including: * Payload storage and filtering * Hybrid search with sparse vectors * Vector quantization and on-disk storage * Distributed deployment * Highlighted features such as query planning, payload indexes, SIMD hardware acceleration, async I/O, and write-ahead logging Qdrant is available as a fully managed cloud service or as an open-source software that can be deployed on-premises.
XLearning
XLearning is a scheduling platform for big data and artificial intelligence, supporting various machine learning and deep learning frameworks. It runs on Hadoop Yarn and integrates frameworks like TensorFlow, MXNet, Caffe, Theano, PyTorch, Keras, XGBoost. XLearning offers scalability, compatibility, multiple deep learning framework support, unified data management based on HDFS, visualization display, and compatibility with code at native frameworks. It provides functions for data input/output strategies, container management, TensorBoard service, and resource usage metrics display. XLearning requires JDK >= 1.7 and Maven >= 3.3 for compilation, and deployment on CentOS 7.2 with Java >= 1.7 and Hadoop 2.6, 2.7, 2.8.
UFO
UFO is a UI-focused dual-agent framework to fulfill user requests on Windows OS by seamlessly navigating and operating within individual or spanning multiple applications.
fuse-med-ml
FuseMedML is a Python framework designed to accelerate machine learning-based discovery in the medical field by promoting code reuse. It provides a flexible design concept where data is stored in a nested dictionary, allowing easy handling of multi-modality information. The framework includes components for creating custom models, loss functions, metrics, and data processing operators. Additionally, FuseMedML offers 'batteries included' key components such as fuse.data for data processing, fuse.eval for model evaluation, and fuse.dl for reusable deep learning components. It supports PyTorch and PyTorch Lightning libraries and encourages the creation of domain extensions for specific medical domains.
gpt4all
GPT4All is an ecosystem to run powerful and customized large language models that work locally on consumer grade CPUs and any GPU. Note that your CPU needs to support AVX or AVX2 instructions. Learn more in the documentation. A GPT4All model is a 3GB - 8GB file that you can download and plug into the GPT4All open-source ecosystem software. Nomic AI supports and maintains this software ecosystem to enforce quality and security alongside spearheading the effort to allow any person or enterprise to easily train and deploy their own on-edge large language models.
tinyllm
tinyllm is a lightweight framework designed for developing, debugging, and monitoring LLM and Agent powered applications at scale. It aims to simplify code while enabling users to create complex agents or LLM workflows in production. The core classes, Function and FunctionStream, standardize and control LLM, ToolStore, and relevant calls for scalable production use. It offers structured handling of function execution, including input/output validation, error handling, evaluation, and more, all while maintaining code readability. Users can create chains with prompts, LLM models, and evaluators in a single file without the need for extensive class definitions or spaghetti code. Additionally, tinyllm integrates with various libraries like Langfuse and provides tools for prompt engineering, observability, logging, and finite state machine design.
Neurite
Neurite is an innovative project that combines chaos theory and graph theory to create a digital interface that explores hidden patterns and connections for creative thinking. It offers a unique workspace blending fractals with mind mapping techniques, allowing users to navigate the Mandelbrot set in real-time. Nodes in Neurite represent various content types like text, images, videos, code, and AI agents, enabling users to create personalized microcosms of thoughts and inspirations. The tool supports synchronized knowledge management through bi-directional synchronization between mind-mapping and text-based hyperlinking. Neurite also features FractalGPT for modular conversation with AI, local AI capabilities for multi-agent chat networks, and a Neural API for executing code and sequencing animations. The project is actively developed with plans for deeper fractal zoom, advanced control over node placement, and experimental features.
OpenAdapt
OpenAdapt is an open-source software adapter between Large Multimodal Models (LMMs) and traditional desktop and web Graphical User Interfaces (GUIs). It aims to automate repetitive GUI workflows by leveraging the power of LMMs. OpenAdapt records user input and screenshots, converts them into tokenized format, and generates synthetic input via transformer model completions. It also analyzes recordings to generate task trees and replay synthetic input to complete tasks. OpenAdapt is model agnostic and generates prompts automatically by learning from human demonstration, ensuring that agents are grounded in existing processes and mitigating hallucinations. It works with all types of desktop GUIs, including virtualized and web, and is open source under the MIT license.
carla
CARLA is an open-source simulator for autonomous driving research. It provides open-source code, protocols, and digital assets (urban layouts, buildings, vehicles) for developing, training, and validating autonomous driving systems. CARLA supports flexible specification of sensor suites and environmental conditions.
aistore
AIStore is a lightweight object storage system designed for AI applications. It is highly scalable, reliable, and easy to use. AIStore can be deployed on any commodity hardware, and it can be used to store and manage large datasets for deep learning and other AI applications.
For similar tasks
CogVideo
CogVideo is an open-source repository that provides pretrained text-to-video models for generating videos based on input text. It includes models like CogVideoX-2B and CogVideo, offering powerful video generation capabilities. The repository offers tools for inference, fine-tuning, and model conversion, along with demos showcasing the model's capabilities through CLI, web UI, and online experiences. CogVideo aims to facilitate the creation of high-quality videos from textual descriptions, catering to a wide range of applications.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.