CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)

Stars: 11017

Visit

CogVideo is an open-source repository that provides pretrained text-to-video models for generating videos based on input text. It includes models like CogVideoX-2B and CogVideo, offering powerful video generation capabilities. The repository offers tools for inference, fine-tuning, and model conversion, along with demos showcasing the model's capabilities through CLI, web UI, and online experiences. CogVideo aims to facilitate the creation of high-quality videos from textual descriptions, catering to a wide range of applications.

README:

CogVideo & CogVideoX

中文阅读

日本語で読む

Experience the CogVideoX-5B model online at 🤗 Huggingface Space or 🤖 ModelScope Space

📚 View the paper and user guide

👋 Join our WeChat and Discord

📍 Visit QingYing and API Platform to experience larger-scale commercial video generation models.

Project Updates

🔥 News: 2025/02/28: DDIM Inverse is now supported in CogVideoX-5B and CogVideoX1.5-5B. Check here.
🔥 News: 2025/01/08: We have updated the code for Lora fine-tuning based on the diffusers version model, which uses less GPU memory. For more details, please see here.
🔥 News: 2024/11/15: We released the CogVideoX1.5 model in the diffusers version. Only minor parameter adjustments are needed to continue using previous code.
🔥 News: 2024/11/08: We have released the CogVideoX1.5 model. CogVideoX1.5 is an upgraded version of the open-source model CogVideoX. The CogVideoX1.5-5B series supports 10-second videos with higher resolution, and CogVideoX1.5-5B-I2V supports video generation at any resolution. The SAT code has already been updated, while the diffusers version is still under adaptation. Download the SAT version code here.
🔥 News: 2024/10/13: A more cost-effective fine-tuning framework for CogVideoX-5B that works with a single 4090 GPU, cogvideox-factory, has been released. It supports fine-tuning with multiple resolutions. Feel free to use it!
🔥 News: 2024/10/10: We have updated our technical report. Please click here to view it. More training details and a demo have been added. To see the demo, click here.- 🔥 News: 2024/10/09: We have publicly released the technical documentation for CogVideoX fine-tuning on Feishu, further increasing distribution flexibility. All examples in the public documentation can be fully reproduced.
🔥 News: 2024/9/19: We have open-sourced the CogVideoX series image-to-video model CogVideoX-5B-I2V. This model can take an image as a background input and generate a video combined with prompt words, offering greater controllability. With this, the CogVideoX series models now support three tasks: text-to-video generation, video continuation, and image-to-video generation. Welcome to try it online at Experience.
🔥 2024/9/19: The Caption model CogVLM2-Caption, used in the training process of CogVideoX to convert video data into text descriptions, has been open-sourced. Welcome to download and use it.
🔥 2024/8/27: We have open-sourced a larger model in the CogVideoX series, CogVideoX-5B. We have significantly optimized the model's inference performance, greatly lowering the inference threshold. You can run CogVideoX-2B on older GPUs like GTX 1080TI, and CogVideoX-5B on desktop GPUs like RTX 3060. Please strictly follow the requirements to update and install dependencies, and refer to cli_demo for inference code. Additionally, the open-source license for the CogVideoX-2B model has been changed to the Apache 2.0 License.
🔥 2024/8/6: We have open-sourced 3D Causal VAE, used for CogVideoX-2B, which can reconstruct videos with almost no loss.
🔥 2024/8/6: We have open-sourced the first model of the CogVideoX series video generation models, **CogVideoX-2B **.
🌱 Source: 2022/5/19: We have open-sourced the CogVideo video generation model (now you can see it in the CogVideo branch). This is the first open-source large Transformer-based text-to-video generation model. You can access the ICLR'23 paper for technical details.

Jump to a specific section:

Quick Start
Gallery
- CogVideoX-5B
- CogVideoX-2B
Model Introduction
Friendly Links
Project Structure
CogVideo(ICLR'23)
Citation
Model-License

Quick Start

Prompt Optimization

Before running the model, please refer to this guide to see how we use large models like GLM-4 (or other comparable products, such as GPT-4) to optimize the model. This is crucial because the model is trained with long prompts, and a good prompt directly impacts the quality of the video generation.

SAT

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

Follow instructions in sat_demo: Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.

Diffusers

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

pip install -r requirements.txt

Then follow diffusers_demo: A more detailed explanation of the inference code, mentioning the significance of common parameters.

For more details on quantized inference, please refer to diffusers-torchao. With Diffusers and TorchAO, quantized inference is also possible leading to memory-efficient inference as well as speedup in some cases when compiled. A full list of memory and time benchmarks with various settings on A100 and H100 has been published at diffusers-torchao.

Gallery

CogVideoX-5B

CogVideoX-2B

To view the corresponding prompt words for the gallery, please click here

Model Introduction

CogVideoX is an open-source version of the video generation model originating from QingYing. The table below displays the list of video generation models we currently offer, along with their foundational information.

Model Name	CogVideoX1.5-5B (Latest)	CogVideoX1.5-5B-I2V (Latest)	CogVideoX-2B	CogVideoX-5B	CogVideoX-5B-I2V
Release Date	November 8, 2024	November 8, 2024	August 6, 2024	August 27, 2024	September 19, 2024
Video Resolution	1360 * 768	Min(W, H) = 768 768 ≤ Max(W, H) ≤ 1360 Max(W, H) % 16 = 0	720 * 480
Number of Frames	Should be 16N + 1 where N <= 10 (default 81)		Should be 8N + 1 where N <= 6 (default 49)
Inference Precision	BF16 (Recommended), FP16, FP32, FP8*, INT8, Not supported: INT4		*FP16(Recommended)*, BF16, FP32, FP8, INT8, Not supported: INT4	BF16 (Recommended), FP16, FP32, FP8*, INT8, Not supported: INT4
Single GPU Memory Usage	SAT BF16: 76GB diffusers BF16: from 10GB* diffusers INT8(torchao): from 7GB*		SAT FP16: 18GB diffusers FP16: 4GB minimum* diffusers INT8 (torchao): 3.6GB minimum*	SAT BF16: 26GB diffusers BF16 : 5GB minimum* diffusers INT8 (torchao): 4.4GB minimum*
Multi-GPU Memory Usage	*BF16: 24GB using diffusers**		*FP16: 10GB using diffusers**	*BF16: 15GB using diffusers**
Inference Speed (Step = 50, FP/BF16)	Single A100: ~1000 seconds (5-second video) Single H100: ~550 seconds (5-second video)		Single A100: ~90 seconds Single H100: ~45 seconds	Single A100: ~180 seconds Single H100: ~90 seconds
Prompt Language	English*
Prompt Token Limit	224 Tokens		226 Tokens
Video Length	5 seconds or 10 seconds		6 seconds
Frame Rate	16 frames / second		8 frames / second
Position Encoding	3d_rope_pos_embed		3d_sincos_pos_embed	3d_rope_pos_embed	3d_rope_pos_embed + learnable_pos_embed
Download Link (Diffusers)	🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel	🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel	🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel	🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel	🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel
Download Link (SAT)	🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel		SAT

Data Explanation

While testing using the diffusers library, all optimizations included in the diffusers library were enabled. This scheme has not been tested for actual memory usage on devices outside of NVIDIA A100 / H100 architectures. Generally, this scheme can be adapted to all NVIDIA Ampere architecture and above devices. If optimizations are disabled, memory consumption will multiply, with peak memory usage being about 3 times the value in the table. However, speed will increase by about 3-4 times. You can selectively disable some optimizations, including:

pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

For multi-GPU inference, the enable_sequential_cpu_offload() optimization needs to be disabled.
Using INT8 models will slow down inference, which is done to accommodate lower-memory GPUs while maintaining minimal video quality loss, though inference speed will significantly decrease.
The CogVideoX-2B model was trained in FP16 precision, and all CogVideoX-5B models were trained in BF16 precision. We recommend using the precision in which the model was trained for inference.
PytorchAO and Optimum-quanto can be used to quantize the text encoder, transformer, and VAE modules to reduce the memory requirements of CogVideoX. This allows the model to run on free T4 Colabs or GPUs with smaller memory! Also, note that TorchAO quantization is fully compatible with torch.compile, which can significantly improve inference speed. FP8 precision must be used on devices with NVIDIA H100 and above, requiring source installation of torch, torchao Python packages. CUDA 12.4 is recommended.
The inference speed tests also used the above memory optimization scheme. Without memory optimization, inference speed increases by about 10%. Only the diffusers version of the model supports quantization.
The model only supports English input; other languages can be translated into English for use via large model refinement.

Friendly Links

We highly welcome contributions from the community and actively contribute to the open-source community. The following works have already been adapted for CogVideoX, and we invite everyone to use them:

RIFLEx-CogVideoX： RIFLEx extends the video with just one line of code: freq[k-1]=(2np.pi)/(Ls). The framework not only supports training-free inference, but also offers models fine-tuned based on CogVideoX. By fine-tuning the model for just 1,000 steps on original-length videos, RIFLEx significantly enhances its length extrapolation capability.
CogVideoX-Fun: CogVideoX-Fun is a modified pipeline based on the CogVideoX architecture, supporting flexible resolutions and multiple launch methods.
CogStudio: A separate repository for CogVideo's Gradio Web UI, which supports more functional Web UIs.
Xorbits Inference: A powerful and comprehensive distributed inference framework, allowing you to easily deploy your own models or the latest cutting-edge open-source models with just one click.
ComfyUI-CogVideoXWrapper Use the ComfyUI framework to integrate CogVideoX into your workflow.
VideoSys: VideoSys provides a user-friendly, high-performance infrastructure for video generation, with full pipeline support and continuous integration of the latest models and techniques.
AutoDL Space: A one-click deployment Huggingface Space image provided by community members.
Interior Design Fine-Tuning Model: is a fine-tuned model based on CogVideoX, specifically designed for interior design.
xDiT: xDiT is a scalable inference engine for Diffusion Transformers (DiTs) on multiple GPU Clusters. xDiT supports real-time image and video generations services. cogvideox-factory: A cost-effective fine-tuning framework for CogVideoX, compatible with the diffusers version model. Supports more resolutions, and fine-tuning CogVideoX-5B can be done with a single 4090 GPU.
CogVideoX-Interpolation: A pipeline based on the modified CogVideoX structure, aimed at providing greater flexibility for keyframe interpolation generation.
DiffSynth-Studio: DiffSynth Studio is a diffusion engine. It has restructured the architecture, including text encoders, UNet, VAE, etc., enhancing computational performance while maintaining compatibility with open-source community models. The framework has been adapted for CogVideoX.
CogVideoX-Controlnet: A simple ControlNet module code that includes the CogVideoX model.
VideoTuna: VideoTuna is the first repo that integrates multiple AI video generation models for text-to-video, image-to-video, text-to-image generation.
ConsisID: An identity-preserving text-to-video generation model, bases on CogVideoX-5B, which keep the face consistent in the generated video by frequency decomposition.
A Step by Step Tutorial: A step-by-step guide on installing and optimizing the CogVideoX1.5-5B-I2V model in Windows and cloud environments. Special thanks to the FurkanGozukara for his effort and support!

Project Structure

This open-source repository will guide developers to quickly get started with the basic usage and fine-tuning examples of the CogVideoX open-source model.

Quick Start with Colab

Here provide three projects that can be run directly on free Colab T4 instances:

CogVideoX-5B-T2V-Colab.ipynb: CogVideoX-5B Text-to-Video Colab code.
CogVideoX-5B-T2V-Int8-Colab.ipynb: CogVideoX-5B Quantized Text-to-Video Inference Colab code, which takes about 30 minutes per run.
CogVideoX-5B-I2V-Colab.ipynb: CogVideoX-5B Image-to-Video Colab code.
CogVideoX-5B-V2V-Colab.ipynb: CogVideoX-5B Video-to-Video Colab code.

Inference

dcli_demo: A more detailed inference code explanation, including the significance of common parameters. All of this is covered here.
cli_demo_quantization: Quantized model inference code that can run on devices with lower memory. You can also modify this code to support running CogVideoX models in FP8 precision.
diffusers_vae_demo: Code for running VAE inference separately.
space demo: The same GUI code as used in the Huggingface Space, with frame interpolation and super-resolution tools integrated.

convert_demo: How to convert user input into long-form input suitable for CogVideoX. Since CogVideoX is trained on long texts, we need to transform the input text distribution to match the training data using an LLM. The script defaults to using GLM-4, but it can be replaced with GPT, Gemini, or any other large language model.
gradio_web_demo: A simple Gradio web application demonstrating how to use the CogVideoX-2B / 5B model to generate videos. Similar to our Huggingface Space, you can use this script to run a simple web application for video generation.

finetune

finetune_demo: Fine-tuning scheme and details of the diffusers version of the CogVideoX model.

sat

sat_demo: Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.

Tools

This folder contains some tools for model conversion / caption generation, etc.

convert_weight_sat2hf: Converts SAT model weights to Huggingface model weights.
caption_demo: Caption tool, a model that understands videos and outputs descriptions in text.
export_sat_lora_weight: SAT fine-tuning model export tool, exports the SAT Lora Adapter in diffusers format.
load_cogvideox_lora: Tool code for loading the diffusers version of fine-tuned Lora Adapter.
llm_flux_cogvideox: Automatically generate videos using an open-source local large language model + Flux + CogVideoX.
parallel_inference_xdit: Supported by xDiT, parallelize the video generation process on multiple GPUs.

CogVideo(ICLR'23)

The official repo for the paper: CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers is on the CogVideo branch

CogVideo is able to generate relatively high-frame-rate videos. A 4-second clip of 32 frames is shown below.

The demo for CogVideo is at https://models.aminer.cn/cogvideo, where you can get hands-on practice on text-to-video generation. The original input is in Chinese.

Citation

🌟 If you find our work helpful, please leave us a star and cite our paper.

@article{yang2024cogvideox,
  title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
  author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
  journal={arXiv preprint arXiv:2408.06072},
  year={2024}
}
@article{hong2022cogvideo,
  title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},
  author={Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang, Jie},
  journal={arXiv preprint arXiv:2205.15868},
  year={2022}
}

We welcome your contributions! You can click here for more information.

Model-License

The code in this repository is released under the Apache 2.0 License.

The CogVideoX-2B model (including its corresponding Transformers module and VAE module) is released under the Apache 2.0 License.

The CogVideoX-5B model (Transformers module, include I2V and T2V) is released under the CogVideoX LICENSE.

For Tasks:

Click tags to check more tools for each tasks

generate videos from text convert weights between models caption videos fine-tune video generation models demo video generation

For Jobs:

data scientist machine learning engineer video content creator ai researcher software developer

Alternative AI tools for CogVideo

Similar Open Source Tools

CogVideo

github

: 11.0k

postgresml

PostgresML is a powerful Postgres extension that seamlessly combines data storage and machine learning inference within your database. It enables running machine learning and AI operations directly within PostgreSQL, leveraging GPU acceleration for faster computations, integrating state-of-the-art large language models, providing built-in functions for text processing, enabling efficient similarity search, offering diverse ML algorithms, ensuring high performance, scalability, and security, supporting a wide range of NLP tasks, and seamlessly integrating with existing PostgreSQL tools and client libraries.

github

: 6.1k

RetouchGPT

RetouchGPT is a novel framework designed for interactive face retouching using Large Language Models (LLMs). It leverages instruction-driven imperfection prediction and LLM-based embedding to guide the retouching process. The tool allows users to interactively modify imperfection features in face images, achieving high-fidelity retouching results. RetouchGPT outperforms existing methods by integrating textual and visual features to accurately identify imperfections and replace them with normal skin features.

github

: 87

Controllable-RAG-Agent

This repository contains a sophisticated deterministic graph-based solution for answering complex questions using a controllable autonomous agent. The solution is designed to ensure that answers are solely based on the provided data, avoiding hallucinations. It involves various steps such as PDF loading, text preprocessing, summarization, database creation, encoding, and utilizing large language models. The algorithm follows a detailed workflow involving planning, retrieval, answering, replanning, content distillation, and performance evaluation. Heuristics and techniques implemented focus on content encoding, anonymizing questions, task breakdown, content distillation, chain of thought answering, verification, and model performance evaluation.

github

: 951

eole

EOLE is an open language modeling toolkit based on PyTorch. It aims to provide a research-friendly approach with a comprehensive yet compact and modular codebase for experimenting with various types of language models. The toolkit includes features such as versatile training and inference, dynamic data transforms, comprehensive large language model support, advanced quantization, efficient finetuning, flexible inference, and tensor parallelism. EOLE is a work in progress with ongoing enhancements in configuration management, command line entry points, reproducible recipes, core API simplification, and plans for further simplification, refactoring, inference server development, additional recipes, documentation enhancement, test coverage improvement, logging enhancements, and broader model support.

github

: 106

LLM-Zero-to-Hundred

LLM-Zero-to-Hundred is a repository showcasing various applications of LLM chatbots and providing insights into training and fine-tuning Language Models. It includes projects like WebGPT, RAG-GPT, WebRAGQuery, LLM Full Finetuning, RAG-Master LLamaindex vs Langchain, open-source-RAG-GEMMA, and HUMAIN: Advanced Multimodal, Multitask Chatbot. The projects cover features like ChatGPT-like interaction, RAG capabilities, image generation and understanding, DuckDuckGo integration, summarization, text and voice interaction, and memory access. Tutorials include LLM Function Calling and Visualizing Text Vectorization. The projects have a general structure with folders for README, HELPER, .env, configs, data, src, images, and utils.

github

: 180

arbigent

Arbigent (Arbiter-Agent) is an AI agent testing framework designed to make AI agent testing practical for modern applications. It addresses challenges faced by traditional UI testing frameworks and AI agents by breaking down complex tasks into smaller, dependent scenarios. The framework is customizable for various AI providers, operating systems, and form factors, empowering users with extensive customization capabilities. Arbigent offers an intuitive UI for scenario creation and a powerful code interface for seamless test execution. It supports multiple form factors, optimizes UI for AI interaction, and is cost-effective by utilizing models like GPT-4o mini. With a flexible code interface and open-source nature, Arbigent aims to revolutionize AI agent testing in modern applications.

github

: 221

qdrant

Qdrant is a vector similarity search engine and vector database. It is written in Rust, which makes it fast and reliable even under high load. Qdrant can be used for a variety of applications, including: * Semantic search * Image search * Product recommendations * Chatbots * Anomaly detection Qdrant offers a variety of features, including: * Payload storage and filtering * Hybrid search with sparse vectors * Vector quantization and on-disk storage * Distributed deployment * Highlighted features such as query planning, payload indexes, SIMD hardware acceleration, async I/O, and write-ahead logging Qdrant is available as a fully managed cloud service or as an open-source software that can be deployed on-premises.

github

: 22.9k

Mooncake

Mooncake is a serving platform for Kimi, a leading LLM service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates prefill and decoding clusters, leveraging underutilized CPU, DRAM, and SSD resources of the GPU cluster. Mooncake's scheduler balances throughput and latency-related SLOs, with a prediction-based early rejection policy for highly overloaded scenarios. It excels in long-context scenarios, achieving up to a 525% increase in throughput while handling 75% more requests under real workloads.

github

: 3.0k

XLearning

XLearning is a scheduling platform for big data and artificial intelligence, supporting various machine learning and deep learning frameworks. It runs on Hadoop Yarn and integrates frameworks like TensorFlow, MXNet, Caffe, Theano, PyTorch, Keras, XGBoost. XLearning offers scalability, compatibility, multiple deep learning framework support, unified data management based on HDFS, visualization display, and compatibility with code at native frameworks. It provides functions for data input/output strategies, container management, TensorBoard service, and resource usage metrics display. XLearning requires JDK >= 1.7 and Maven >= 3.3 for compilation, and deployment on CentOS 7.2 with Java >= 1.7 and Hadoop 2.6, 2.7, 2.8.

github

: 1.7k

Neurite

Neurite is an innovative project that combines chaos theory and graph theory to create a digital interface that explores hidden patterns and connections for creative thinking. It offers a unique workspace blending fractals with mind mapping techniques, allowing users to navigate the Mandelbrot set in real-time. Nodes in Neurite represent various content types like text, images, videos, code, and AI agents, enabling users to create personalized microcosms of thoughts and inspirations. The tool supports synchronized knowledge management through bi-directional synchronization between mind-mapping and text-based hyperlinking. Neurite also features FractalGPT for modular conversation with AI, local AI capabilities for multi-agent chat networks, and a Neural API for executing code and sequencing animations. The project is actively developed with plans for deeper fractal zoom, advanced control over node placement, and experimental features.

github

: 891

LinguaHaru

Next-generation AI translation tool that provides high-quality, precise translations for various common file formats with a single click. It is based on cutting-edge large language models, offering exceptional translation quality with minimal operation, supporting multiple document formats and languages. Features include multi-format compatibility, global language translation, one-click rapid translation, flexible translation engines, and LAN sharing for efficient collaborative work.

github

: 93

tinyllm

tinyllm is a lightweight framework designed for developing, debugging, and monitoring LLM and Agent powered applications at scale. It aims to simplify code while enabling users to create complex agents or LLM workflows in production. The core classes, Function and FunctionStream, standardize and control LLM, ToolStore, and relevant calls for scalable production use. It offers structured handling of function execution, including input/output validation, error handling, evaluation, and more, all while maintaining code readability. Users can create chains with prompts, LLM models, and evaluators in a single file without the need for extensive class definitions or spaghetti code. Additionally, tinyllm integrates with various libraries like Langfuse and provides tools for prompt engineering, observability, logging, and finite state machine design.

github

: 90

aistore

AIStore is a lightweight object storage system designed for AI applications. It is highly scalable, reliable, and easy to use. AIStore can be deployed on any commodity hardware, and it can be used to store and manage large datasets for deep learning and other AI applications.

github

: 1.5k

ScholarCopilot

Scholar Copilot is an intelligent academic writing assistant that enhances the research writing process through AI-powered text completion and citation suggestions. It aims to streamline academic writing while maintaining high scholarly standards. The tool provides features such as smart text generation with next-3-sentence suggestions, full section auto-completion, and context-aware writing. It also offers intelligent citation management with real-time citation suggestions, one-click citation insertion, and citation Bibtex generation. Scholar Copilot employs a unified model architecture that integrates retrieval and generation through a dynamic switching mechanism, ensuring coherent text generation with appropriate citation points.

github

: 86

YuE

YuE (乐) is an open-source foundation model designed for music generation, specifically transforming lyrics into full songs. It can generate complete songs in various genres and vocal styles, ensuring a polished and cohesive result. The model requires significant GPU memory for generating long sequences and recommends specific configurations for optimal performance. Users can customize the number of sessions for memory usage. The tool provides a quickstart guide for generating music using Transformers and includes tips for execution time and tag selection. The project is licensed under Creative Commons Attribution Non Commercial 4.0.

github

: 1.1k

For similar tasks

CogVideo

github

: 11.0k

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 855

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.3k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 30.6k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675

CogVideo

README:

CogVideo & CogVideoX

Project Updates

Table of Contents

Quick Start

Prompt Optimization

SAT

Diffusers

Gallery

CogVideoX-5B

CogVideoX-2B

Model Introduction

Friendly Links

Project Structure

Quick Start with Colab

Inference

finetune

sat

Tools

CogVideo(ICLR'23)

Citation

Model-License

For Tasks:

For Jobs:

Alternative AI tools for CogVideo

Similar Open Source Tools

CogVideo

postgresml

RetouchGPT

Controllable-RAG-Agent

eole

LLM-Zero-to-Hundred

arbigent

qdrant

Mooncake

XLearning

Neurite

LinguaHaru

tinyllm

aistore

ScholarCopilot

YuE

For similar tasks

CogVideo

For similar jobs

weave

LLMStack

VisionCraft

kaito

PyRIT

tabby

spear

Magick