VideoTuna
Let's finetune video generation models!
Stars: 330
VideoTuna is a codebase for text-to-video applications that integrates multiple AI video generation models for text-to-video, image-to-video, and text-to-image generation. It provides comprehensive pipelines in video generation, including pre-training, continuous training, post-training, and fine-tuning. The models in VideoTuna include U-Net and DiT architectures for visual generation tasks, with upcoming releases of a new 3D video VAE and a controllable facial video generation model.
README:
π€π€π€ Videotuna is a useful codebase for text-to-video applications.
π VideoTuna is the first repo that integrates multiple AI video generation models for text-to-video, image-to-video, text-to-image generation (to the best of our knowledge).
π VideoTuna is the first repo that provides comprehensive pipelines in video generation, including pre-training, continuous training, post-training (alignment), and fine-tuning (to the best of our knowledge).
π The models of VideoTuna include both U-Net and DiT architectures for visual generation tasks.
π A new 3D video VAE, and a controllable facial video generation model will be released soon.
π All-in-one framework: Inference and fine-tune up-to-date video generation models.
π Pre-training: Build your own foundational text-to-video model.
π Continuous training: Keep improving your model with new data.
π Domain-specific fine-tuning: Adapt models to your specific scenario.
π Concept-specific fine-tuning: Teach your models with unique concepts.
π Enhanced language understanding: Improve model comprehension through continuous training.
π Post-processing: Enhance the videos with video-to-video enhancement model.
π Post-training/Human preference alignment: Post-training with RLHF for more attractive results.
- [2024-11-01] We make the VideoTuna V0.1.0 public!
The 3D video VAE from VideoTuna can accurately compress and reconstruct the input videos with fine details.
Ground Truth | Reconstruction |
Ground Truth | Reconstruction |
Ground Truth | Reconstruction |
Ground Truth | Reconstruction |
Ground Truth | Reconstruction |
Ground Truth | Reconstruction |
Ground Truth | Reconstruction |
Input 1 | Input 2 | Input 3 |
Emotion: Anger | Emotion: Disgust | Emotion: Fear |
Emotion: Happy | Emotion: Sad | Emotion: Surprise |
Emotion: Anger | Emotion: Disgust | Emotion: Fear |
Emotion: Happy | Emotion: Sad | Emotion: Surprise |
Emotion: Anger | Emotion: Disgust | Emotion: Fear |
Emotion: Happy | Emotion: Sad | Emotion: Surprise |
- [ ] More demo and applications
- [ ] More functionalities such as control modules. (Suggestions are welcome!)
VideoTuna/
βββ assets # put images for readme
βββ checkpoints # put model checkpoints here
βββ configs # model and experimental configs
βββ data # data processing scripts and dataset files
βββ docs # documentations
βββ eval # evaluation scripts
βββ inputs # input examples for testing
βββ scripts # train and inference python scripts
βββ shsripts # train and inference shell scripts
βββ src # model-related source code
βββ tests # testing scripts
βββ tools # some tool scripts
T2V-Models | HxWxL | Checkpoints |
---|---|---|
HunyuanVideo | 720x1280x129 | Hugging Face |
Mochi | 848x480, 3s | Hugging Face |
CogVideoX-2B | 480x720x49 | Hugging Face |
CogVideoX-5B | 480x720x49 | Hugging Face |
Open-Sora 1.0 | 512Γ512x16 | Hugging Face |
Open-Sora 1.0 | 256Γ256x16 | Hugging Face |
Open-Sora 1.0 | 256Γ256x16 | Hugging Face |
VideoCrafter2 | 320x512x16 | Hugging Face |
VideoCrafter1 | 576x1024x16 | Hugging Face |
VideoCrafter1 | 320x512x16 | Hugging Face |
I2V-Models | HxWxL | Checkpoints |
---|---|---|
CogVideoX-5B-I2V | 480x720x49 | Hugging Face |
DynamiCrafter | 576x1024x16 | Hugging Face |
VideoCrafter1 | 320x512x16 | Hugging Face |
- Note: H: height; W: width; L: length
Please check docs/CHECKPOINTS.md to download all the model checkpoints.
conda create --name videotuna python=3.10 -y
conda activate videotuna
pip install -r requirements.txt
git clone https://github.com/JingyeChen/SwissArmyTransformer
pip install -e SwissArmyTransformer/
git clone https://github.com/tgxs002/HPSv2.git
cd ./HPSv2
pip install -e .
cd ..
conda config --add channels conda-forge
conda install ffmpeg
Please follow docs/CHECKPOINTS.md to download model checkpoints.
After downloading, the model checkpoints should be placed as Checkpoint Structure.
- Inference a set of text-to-video models in one command:
bash tools/video_comparison/compare.sh
- The default mode is to run all models, e.g.,
inference_methods="videocrafter2;dynamicrafter;cogvideoβt2v;cogvideoβi2v;opensora"
- If the users want to inference specific models, modify the
inference_methods
variable incompare.sh
, and list the desired models separated by semicolons. - Also specify the input directory via the
input_dir
variable. This directory should contain aprompts.txt
file, where each line corresponds to a prompt for the video generation. The defaultinput_dir
isinputs/t2v
- The default mode is to run all models, e.g.,
- Inference a set of image-to-video models in one command:
bash tools/video_comparison/compare_i2v.sh
- Inference a specific model, run the corresponding commands as follows:
Task | Model | Command | Length (#frames) | Resolution | Inference Time (s) | GPU Memory (GiB) |
---|---|---|---|---|---|---|
T2V | HunyuanVideo | bash shscripts/inference_hunyuan_diffusers.sh |
129 | 720x1280 | 1920 | 59.15 |
T2V | Mochi | bash shscripts/inference_mochi.sh |
84 | 480x848 | 109.0 | 26 |
I2V | CogVideoX-5b-I2V | bash shscripts/inference_cogVideo_i2v_diffusers.sh |
49 | 480x720 | 310.4 | 4.78 |
T2V | CogVideoX-2b | bash shscripts/inference_cogVideo_t2v_diffusers.sh |
49 | 480x720 | 107.6 | 2.32 |
T2V | Open Sora V1.0 | bash shscripts/inference_opensora_v10_16x256x256.sh |
16 | 256x256 | 11.2 | 23.99 |
T2V | VideoCrafter-V2-320x512 | bash shscripts/inference_vc2_t2v_320x512.sh |
16 | 320x512 | 26.4 | 10.03 |
T2V | VideoCrafter-V1-576x1024 | bash shscripts/inference_vc1_t2v_576x1024.sh |
16 | 576x1024 | 91.4 | 14.57 |
I2V | DynamiCrafter | bash shscripts/inference_dc_i2v_576x1024.sh |
16 | 576x1024 | 101.7 | 52.23 |
I2V | VideoCrafter-V1 | bash shscripts/inference_vc1_i2v_320x512.sh |
16 | 320x512 | 26.4 | 10.03 |
T2I | Flux-dev | bash shscripts/inference_flux.sh |
1 | 768x1360 | 238.1 | 1.18 |
T2I | Flux-schnell | bash shscripts/inference_flux.sh |
1 | 768x1360 | 5.4 | 1.20 |
Please follow the docs/datasets.md to try provided toydataset or build your own datasets.
Before started, we assume you have finished the following two preliminary steps:
- Install the environment
- Prepare the dataset
- Download the checkpoints and get these two checkpoints
ll checkpoints/videocrafter/t2v_v2_512/model.ckpt
ll checkpoints/stablediffusion/v2-1_512-ema/model.ckpt
First, run this command to convert the VC2 checkpoint as we make minor modifications on the keys of the state dict of the checkpoint. The converted checkpoint will be automatically save at checkpoints/videocrafter/t2v_v2_512/model_converted.ckpt
.
python tools/convert_checkpoint.py --input_path checkpoints/videocrafter/t2v_v2_512/model.ckpt
Second, run this command to start training on the single GPU. The training results will be automatically saved at results/train/${CURRENT_TIME}_${EXPNAME}
bash shscripts/train_videocrafter_v2.sh
We support lora finetuning to make the model to learn new concepts/characters/styles.
- Example config file:
configs/001_videocrafter2/vc2_t2v_lora.yaml
- Training lora based on VideoCrafter2:
bash shscripts/train_videocrafter_lora.sh
- Inference the trained models:
bash shscripts/inference_vc2_t2v_320x512_lora.sh
We support open-sora finetuning, you can simply run the following commands:
# finetune the Open-Sora v1.0
bash shscripts/train_opensorav10.sh
We support VBench evaluation to evaluate the T2V generation performance. Please check eval/README.md for details.
We thank the following repos for sharing their awesome models and codes!
- Mochi: A new SOTA in open-source video generation models
- VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
- VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
- DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
- Open-Sora: Democratizing Efficient Video Production for All
- CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
- VADER: Video Diffusion Alignment via Reward Gradients
- VBench: Comprehensive Benchmark Suite for Video Generative Models
- Flux: Text-to-image models from Black Forest Labs.
- SimpleTuner: A fine-tuning kit for text-to-image generation.
- LLMs-Meet-MM-Generation: A paper collection of utilizing LLMs for multimodal generation (image, video, 3D and audio).
- MMTrail: A multimodal trailer video dataset with language and music descriptions.
- Seeing-and-Hearing: A versatile framework for Joint VA generation, V2A, A2V, and I2A.
- Self-Cascade: A Self-Cascade model for higher-resolution image and video generation.
- ScaleCrafter and HiPrompt: Free method for higher-resolution image and video generation.
- FreeTraj and FreeNoise: Free method for video trajectory control and longer-video generation.
- Follow-Your-Emoji, Follow-Your-Click, and Follow-Your-Pose: Follow family for controllable video generation.
- Animate-A-Story: A framework for storytelling video generation.
- LVDM: Latent Video Diffusion Model for long video generation and text-to-video generation.
Please follow CC-BY-NC-ND. If you want a license authorization, please contact the project leads Yingqing He ([email protected]) and Yazhou Xing ([email protected]).
@software{videotuna,
author = {Yingqing He and Yazhou Xing and Zhefan Rao and Haoyu Wu and Zhaoyang Liu and Jingye Chen and Pengjun Fang and Jiajun Li and Liya Ji and Runtao Liu and Xiaowei Chi and Yang Fei and Guocheng Shao and Yue Ma and Qifeng Chen},
title = {VideoTuna: A Powerful Toolkit for Video Generation with Model Fine-Tuning and Post-Training},
month = {Nov},
year = {2024},
url = {https://github.com/VideoVerses/VideoTuna}
}
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for VideoTuna
Similar Open Source Tools
VideoTuna
VideoTuna is a codebase for text-to-video applications that integrates multiple AI video generation models for text-to-video, image-to-video, and text-to-image generation. It provides comprehensive pipelines in video generation, including pre-training, continuous training, post-training, and fine-tuning. The models in VideoTuna include U-Net and DiT architectures for visual generation tasks, with upcoming releases of a new 3D video VAE and a controllable facial video generation model.
dom-to-semantic-markdown
DOM to Semantic Markdown is a tool that converts HTML DOM to Semantic Markdown for use in Large Language Models (LLMs). It maximizes semantic information, token efficiency, and preserves metadata to enhance LLMs' processing capabilities. The tool captures rich web content structure, including semantic tags, image metadata, table structures, and link destinations. It offers customizable conversion options and supports both browser and Node.js environments.
ColossalAI
Colossal-AI is a deep learning system for large-scale parallel training. It provides a unified interface to scale sequential code of model training to distributed environments. Colossal-AI supports parallel training methods such as data, pipeline, tensor, and sequence parallelism and is integrated with heterogeneous training and zero redundancy optimizer.
pytorch-grad-cam
This repository provides advanced AI explainability for PyTorch, offering state-of-the-art methods for Explainable AI in computer vision. It includes a comprehensive collection of Pixel Attribution methods for various tasks like Classification, Object Detection, Semantic Segmentation, and more. The package supports high performance with full batch image support and includes metrics for evaluating and tuning explanations. Users can visualize and interpret model predictions, making it suitable for both production and model development scenarios.
rig
Rig is a Rust library designed for building scalable, modular, and user-friendly applications powered by large language models (LLMs). It provides full support for LLM completion and embedding workflows, offers simple yet powerful abstractions for LLM providers like OpenAI and Cohere, as well as vector stores such as MongoDB and in-memory storage. With Rig, users can easily integrate LLMs into their applications with minimal boilerplate code.
PromptClip
PromptClip is a tool that allows developers to create video clips using LLM prompts. Users can upload videos from various sources, prompt the video in natural language, use different LLM models, instantly watch the generated clips, finetune the clips, and add music or image overlays. The tool provides a seamless way to extract specific moments from videos based on user queries, making video editing and content creation more efficient and intuitive.
MiniCPM-V
MiniCPM-V is a series of end-side multimodal LLMs designed for vision-language understanding. The models take image and text inputs to provide high-quality text outputs. The series includes models like MiniCPM-Llama3-V 2.5 with 8B parameters surpassing proprietary models, and MiniCPM-V 2.0, a lighter model with 2B parameters. The models support over 30 languages, efficient deployment on end-side devices, and have strong OCR capabilities. They achieve state-of-the-art performance on various benchmarks and prevent hallucinations in text generation. The models can process high-resolution images efficiently and support multilingual capabilities.
FlagEmbedding
FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently: * **Long-Context LLM** : Activation Beacon * **Fine-tuning of LM** : LM-Cocktail * **Embedding Model** : Visualized-BGE, BGE-M3, LLM Embedder, BGE Embedding * **Reranker Model** : llm rerankers, BGE Reranker * **Benchmark** : C-MTEB
aichat
Aichat is an AI-powered CLI chat and copilot tool that seamlessly integrates with over 10 leading AI platforms, providing a powerful combination of chat-based interaction, context-aware conversations, and AI-assisted shell capabilities, all within a customizable and user-friendly environment.
Qmedia
QMedia is an open-source multimedia AI content search engine designed specifically for content creators. It provides rich information extraction methods for text, image, and short video content. The tool integrates unstructured text, image, and short video information to build a multimodal RAG content Q&A system. Users can efficiently search for image/text and short video materials, analyze content, provide content sources, and generate customized search results based on user interests and needs. QMedia supports local deployment for offline content search and Q&A for private data. The tool offers features like content cards display, multimodal content RAG search, and pure local multimodal models deployment. Users can deploy different types of models locally, manage language models, feature embedding models, image models, and video models. QMedia aims to spark new ideas for content creation and share AI content creation concepts in an open-source manner.
amica
Amica is an application that allows you to easily converse with 3D characters in your browser. You can import VRM files, adjust the voice to fit the character, and generate response text that includes emotional expressions.
auto-news
Auto-News is an automatic news aggregator tool that utilizes Large Language Models (LLM) to pull information from various sources such as Tweets, RSS feeds, YouTube videos, web articles, Reddit, and journal notes. The tool aims to help users efficiently read and filter content based on personal interests, providing a unified reading experience and organizing information effectively. It features feed aggregation with summarization, transcript generation for videos and articles, noise reduction, task organization, and deep dive topic exploration. The tool supports multiple LLM backends, offers weekly top-k aggregations, and can be deployed on Linux/MacOS using docker-compose or Kubernetes.
MaxKB
MaxKB is a knowledge base Q&A system based on the LLM large language model. MaxKB = Max Knowledge Base, which aims to become the most powerful brain of the enterprise.
TempCompass
TempCompass is a benchmark designed to evaluate the temporal perception ability of Video LLMs. It encompasses a diverse set of temporal aspects and task formats to comprehensively assess the capability of Video LLMs in understanding videos. The benchmark includes conflicting videos to prevent models from relying on single-frame bias and language priors. Users can clone the repository, install required packages, prepare data, run inference using examples like Video-LLaVA and Gemini, and evaluate the performance of their models across different tasks such as Multi-Choice QA, Yes/No QA, Caption Matching, and Caption Generation.
superagentx
SuperAgentX is a lightweight open-source AI framework designed for multi-agent applications with Artificial General Intelligence (AGI) capabilities. It offers goal-oriented multi-agents with retry mechanisms, easy deployment through WebSocket, RESTful API, and IO console interfaces, streamlined architecture with no major dependencies, contextual memory using SQL + Vector databases, flexible LLM configuration supporting various Gen AI models, and extendable handlers for integration with diverse APIs and data sources. It aims to accelerate the development of AGI by providing a powerful platform for building autonomous AI agents capable of executing complex tasks with minimal human intervention.
For similar tasks
VideoTuna
VideoTuna is a codebase for text-to-video applications that integrates multiple AI video generation models for text-to-video, image-to-video, and text-to-image generation. It provides comprehensive pipelines in video generation, including pre-training, continuous training, post-training, and fine-tuning. The models in VideoTuna include U-Net and DiT architectures for visual generation tasks, with upcoming releases of a new 3D video VAE and a controllable facial video generation model.
lunary
Lunary is an open-source observability and prompt platform for Large Language Models (LLMs). It provides a suite of features to help AI developers take their applications into production, including analytics, monitoring, prompt templates, fine-tuning dataset creation, chat and feedback tracking, and evaluations. Lunary is designed to be usable with any model, not just OpenAI, and is easy to integrate and self-host.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.