Video-ChatGPT

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

Stars: 1324

Visit

README:

Oryx Video-ChatGPT 🎥 💬

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models [ACL 2024 🔥]

Muhammad Maaz* , Hanoona Rasheed* , Salman Khan and Fahad Khan

* Equally contributing first authors

Mohamed bin Zayed University of Artificial Intelligence

Diverse Video-based Generative Performance Benchmarking (VCGBench-Diverse)

Video-based Generative Performance Benchmarking

Zeroshot Question-Answer Evaluation

Demo	Paper	Demo Clips	Offline Demo	Training	Video Instruction Data	Quantitative Evaluation	Qualitative Analysis
			Offline Demo	Training	Video Instruction Dataset	Quantitative Evaluation	Qualitative Analysis

📢 Latest Updates

Mar-28-25: Mobile-VideoGPT is released. It achieves excellent results on multiple benchmarks with 2x higher throughput. Check it out Mobile-VideoGPT 🔥🔥

Jun-14-24: VideoGPT+ is released. It achieves SoTA results on multiple benchmarks. Check it out at VideoGPT+ 🔥🔥
Jun-14-24: Semi-automatic video annotation pipeline is released. Check it out at GitHub, HuggingFace. 🔥🔥
Jun-14-24: VCGBench-Diverse Benchmarks are released. It provides 4,354 human annotated QA pairs across 18 video categories to extensively evaluate the performance of a video-conversation model. Check it out at GitHub, HuggingFace. 🔥🔥

May-16-24: Video-ChatGPT is accepted at ACL 2024! 🎊🎊
Sep-30-23: Our VideoInstruct100K dataset can be downloaded from HuggingFace/VideoInstruct100K. 🔥🔥
Jul-15-23: Our quantitative evaluation benchmark for Video-based Conversational Models now has its own dedicated website: https://mbzuai-oryx.github.io/Video-ChatGPT. 🔥🔥
Jun-28-23: Updated GitHub readme featuring benchmark comparisons of Video-ChatGPT against recent models - Video Chat, Video LLaMA, and LLaMA Adapter. Amid these advanced conversational models, Video-ChatGPT continues to deliver state-of-the-art performance.:fire::fire:
Jun-08-23 : Released the training code, offline demo, instructional data and technical report. All the resources including models, datasets and extracted features are available here. 🔥🔥
May-21-23 : Video-ChatGPT: demo released.

Online Demo 💻

🔥🔥 You can try our demo using the provided examples or by uploading your own videos HERE. 🔥🔥

🔥🔥 Or click the image to try the demo! 🔥🔥 You can access all the videos we demonstrate on here.

Video-ChatGPT Overview 💡

Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation.

Contributions 🏆

We introduce 100K high-quality video-instruction pairs together with a novel annotation framework that is scalable and generates a diverse range of video-specific instruction sets of high-quality.
We develop the first quantitative video conversation evaluation framework for benchmarking video conversation models.
Unique multimodal (vision-language) capability combining video understanding and language generation that is comprehensively evaluated using quantitative and qualitiative comparisons on video reasoning, creativitiy, spatial and temporal understanding, and action recognition tasks.

Installation 🔧

We recommend setting up a conda environment for the project:

conda create --name=video_chatgpt python=3.10
conda activate video_chatgpt

git clone https://github.com/mbzuai-oryx/Video-ChatGPT.git
cd Video-ChatGPT
pip install -r requirements.txt

export PYTHONPATH="./:$PYTHONPATH"

Additionally, install FlashAttention for training,

pip install ninja

git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
git checkout v1.0.7
python setup.py install

Running Demo Offline 💿

To run the demo offline, please refer to the instructions in offline_demo.md.

Training 🚋

For training instructions, check out train_video_chatgpt.md.

Video Instruction Dataset 📂

We are releasing our 100,000 high-quality video instruction dataset that was used for training our Video-ChatGPT model. You can download the dataset from here. More details on our human-assisted and semi-automatic annotation framework for generating the data are available at VideoInstructionDataset.md.

Quantitative Evaluation 📊

Our paper introduces a new Quantitative Evaluation Framework for Video-based Conversational Models. To explore our benchmarks and understand the framework in greater detail, please visit our dedicated website: https://mbzuai-oryx.github.io/Video-ChatGPT.

For detailed instructions on performing quantitative evaluation, please refer to QuantitativeEvaluation.md.

Video-based Generative Performance Benchmarking and Zero-Shot Question-Answer Evaluation tables are provided for a detailed performance overview.

Zero-Shot Question-Answer Evaluation

Model	MSVD-QA		MSRVTT-QA		TGIF-QA		Activity Net-QA
	Accuracy	Score	Accuracy	Score	Accuracy	Score	Accuracy	Score
FrozenBiLM	32.2	--	16.8	--	41.0	--	24.7	--
Video Chat	56.3	2.8	45.0	2.5	34.4	2.3	26.5	2.2
LLaMA Adapter	54.9	3.1	43.8	2.7	-	-	34.2	2.7
Video LLaMA	51.6	2.5	29.6	1.8	-	-	12.4	1.1
Video-ChatGPT	64.9	3.3	49.3	2.8	51.4	3.0	35.2	2.7

Video-based Generative Performance Benchmarking

Evaluation Aspect	Video Chat	LLaMA Adapter	Video LLaMA	Video-ChatGPT
Correctness of Information	2.23	2.03	1.96	2.40
Detail Orientation	2.50	2.32	2.18	2.52
Contextual Understanding	2.53	2.30	2.16	2.62
Temporal Understanding	1.94	1.98	1.82	1.98
Consistency	2.24	2.15	1.79	2.37

Qualitative Analysis 🔍

A Comprehensive Evaluation of Video-ChatGPT's Performance across Multiple Tasks.

Video Reasoning Tasks 🎥

Creative and Generative Tasks 🖌️

Spatial Understanding 🌐

Video Understanding and Conversational Tasks 💬

Action Recognition 🏃

Question Answering Tasks ❓

Temporal Understanding ⏳

Acknowledgements 🙏

LLaMA: A great attempt towards open and efficient LLMs!
Vicuna: Has the amazing language capabilities!
LLaVA: our architecture is inspired from LLaVA.
Thanks to our colleagues at MBZUAI for their essential contribution to the video annotation task, including Salman Khan, Fahad Khan, Abdelrahman Shaker, Shahina Kunhimon, Muhammad Uzair, Sanoojan Baliah, Malitha Gunawardhana, Akhtar Munir, Vishal Thengane, Vignagajan Vigneswaran, Jiale Cao, Nian Liu, Muhammad Ali, Gayal Kurrupu, Roba Al Majzoub, Jameel Hassan, Hanan Ghani, Muzammal Naseer, Akshay Dudhane, Jean Lahoud, Awais Rauf, Sahal Shaji, Bokang Jia, without which this project would not be possible.

If you're using Video-ChatGPT in your research or applications, please cite using this BibTeX:

@inproceedings{Maaz2023VideoChatGPT,
    title={Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models},
    author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz},
    booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)},
    year={2024}
}

License 📜

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Looking forward to your feedback, contributions, and stars! 🌟 Please raise any issues or questions here.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for Video-ChatGPT

Similar Open Source Tools

Video-ChatGPT

github

: 1.3k

HaE

HaE is a framework project in the field of network security (data security) that combines artificial intelligence (AI) large models to achieve highlighting and information extraction of HTTP messages (including WebSocket). It aims to reduce testing time, focus on valuable and meaningful messages, and improve vulnerability discovery efficiency. The project provides a clear and visual interface design, simple interface interaction, and centralized data panel for querying and extracting information. It also features built-in color upgrade algorithm, one-click export/import of data, and integration of AI large models API for optimized data processing.

github

: 2.7k

Step-DPO

Step-DPO is a method for enhancing long-chain reasoning ability of LLMs with a data construction pipeline creating a high-quality dataset. It significantly improves performance on math and GSM8K tasks with minimal data and training steps. The tool fine-tunes pre-trained models like Qwen2-7B-Instruct with Step-DPO, achieving superior results compared to other models. It provides scripts for training, evaluation, and deployment, along with examples and acknowledgements.

github

: 155

sktime

sktime is a Python library for time series analysis that provides a unified interface for various time series learning tasks such as classification, regression, clustering, annotation, and forecasting. It offers time series algorithms and tools compatible with scikit-learn for building, tuning, and validating time series models. sktime aims to enhance the interoperability and usability of the time series analysis ecosystem by empowering users to apply algorithms across different tasks and providing interfaces to related libraries like scikit-learn, statsmodels, tsfresh, PyOD, and fbprophet.

github

: 8.3k

Awesome-LLM-RAG-Application

Awesome-LLM-RAG-Application is a repository that provides resources and information about applications based on Large Language Models (LLM) with Retrieval-Augmented Generation (RAG) pattern. It includes a survey paper, GitHub repo, and guides on advanced RAG techniques. The repository covers various aspects of RAG, including academic papers, evaluation benchmarks, downstream tasks, tools, and technologies. It also explores different frameworks, preprocessing tools, routing mechanisms, evaluation frameworks, embeddings, security guardrails, prompting tools, SQL enhancements, LLM deployment, observability tools, and more. The repository aims to offer comprehensive knowledge on RAG for readers interested in exploring and implementing LLM-based systems and products.

github

: 687

langtrace

Langtrace is an open source observability software that lets you capture, debug, and analyze traces and metrics from all your applications that leverage LLM APIs, Vector Databases, and LLM-based Frameworks. It supports Open Telemetry Standards (OTEL), and the traces generated adhere to these standards. Langtrace offers both a managed SaaS version (Langtrace Cloud) and a self-hosted option. The SDKs for both Typescript/Javascript and Python are available, making it easy to integrate Langtrace into your applications. Langtrace automatically captures traces from various vendors, including OpenAI, Anthropic, Azure OpenAI, Langchain, LlamaIndex, Pinecone, and ChromaDB.

github

: 856

Xwin-LM

Xwin-LM is a powerful and stable open-source tool for aligning large language models, offering various alignment technologies like supervised fine-tuning, reward models, reject sampling, and reinforcement learning from human feedback. It has achieved top rankings in benchmarks like AlpacaEval and surpassed GPT-4. The tool is continuously updated with new models and features.

github

: 982

Nocode-Wep

Nocode/WEP is a forward-looking office visualization platform that includes modules for document building, web application creation, presentation design, and AI capabilities for office scenarios. It supports features such as configuring bullet comments, global article comments, multimedia content, custom drawing boards, flowchart editor, form designer, keyword annotations, article statistics, custom appreciation settings, JSON import/export, content block copying, and unlimited hierarchical directories. The platform is compatible with major browsers and aims to deliver content value, iterate products, share technology, and promote open-source collaboration.

github

: 143

skpro

skpro is a library for supervised probabilistic prediction in python. It provides `scikit-learn`-like, `scikit-base` compatible interfaces to: * tabular **supervised regressors for probabilistic prediction** \- interval, quantile and distribution predictions * tabular **probabilistic time-to-event and survival prediction** \- instance-individual survival distributions * **metrics to evaluate probabilistic predictions** , e.g., pinball loss, empirical coverage, CRPS, survival losses * **reductions** to turn `scikit-learn` regressors into probabilistic `skpro` regressors, such as bootstrap or conformal * building **pipelines and composite models** , including tuning via probabilistic performance metrics * symbolic **probability distributions** with value domain of `pandas.DataFrame`-s and `pandas`-like interface

github

: 259

Code-Review-GPT-Gitlab

A project that utilizes large models to help with Code Review on Gitlab, aimed at improving development efficiency. The project is customized for Gitlab and is developing a Multi-Agent plugin for collaborative review. It integrates various large models for code security issues and stays updated with the latest Code Review trends. The project architecture is designed to be powerful, flexible, and efficient, with easy integration of different models and high customization for developers.

github

: 452

ChuanhuChatGPT

Chuanhu Chat is a user-friendly web graphical interface that provides various additional features for ChatGPT and other language models. It supports GPT-4, file-based question answering, local deployment of language models, online search, agent assistant, and fine-tuning. The tool offers a range of functionalities including auto-solving questions, online searching with network support, knowledge base for quick reading, local deployment of language models, GPT 3.5 fine-tuning, and custom model integration. It also features system prompts for effective role-playing, basic conversation capabilities with options to regenerate or delete dialogues, conversation history management with auto-saving and search functionalities, and a visually appealing user experience with themes, dark mode, LaTeX rendering, and PWA application support.

github

: 15.2k

vlmrun-cookbook

VLM Run Cookbook is a repository containing practical examples and tutorials for extracting structured data from images, videos, and documents using Vision Language Models (VLMs). It offers comprehensive Colab notebooks demonstrating real-world applications of VLM Run, with complete code and documentation for easy adaptation. The examples cover various domains such as financial documents and TV news analysis.

github

: 259

AI-LLM-ML-CS-Quant-Review

This repository provides an in-depth review of industry trends in AI, Large Language Models (LLMs), Machine Learning, Computer Science, and Quantitative Finance. It covers various topics such as NVIDIA GTC conferences, DeepSeek theory and applications, LangGraph & Cursor AI, LLM essentials, system design, computer systems, big data and AI in finance, C++ design patterns, high-frequency finance, machine learning for algorithmic trading, stochastic volatility modeling, and quant job interview questions.

github

: 56

speechless

Speechless.AI is committed to integrating the superior language processing and deep reasoning capabilities of large language models into practical business applications. By enhancing the model's language understanding, knowledge accumulation, and text creation abilities, and introducing long-term memory, external tool integration, and local deployment, our aim is to establish an intelligent collaborative partner that can independently interact, continuously evolve, and closely align with various business scenarios.

github

: 100

agentica

Agentica is a human-centric framework for building large language model agents. It provides functionalities for planning, memory management, tool usage, and supports features like reflection, planning and execution, RAG, multi-agent, multi-role, and workflow. The tool allows users to quickly code and orchestrate agents, customize prompts, and make API calls to various services. It supports API calls to OpenAI, Azure, Deepseek, Moonshot, Claude, Ollama, and Together. Agentica aims to simplify the process of building AI agents by providing a user-friendly interface and a range of functionalities for agent development.

github

: 108

AI-LLM-ML-CS-Quant-Overview

AI-LLM-ML-CS-Quant-Overview is a repository providing overview notes on AI, Large Language Models (LLM), Machine Learning (ML), Computer Science (CS), and Quantitative Finance. It covers various topics such as LangGraph & Cursor AI, DeepSeek, MoE (Mixture of Experts), NVIDIA GTC, LLM Essentials, System Design, Computer Systems, Big Data and AI in Finance, Econometrics and Statistics Conference, C++ Design Patterns and Derivatives Pricing, High-Frequency Finance, Machine Learning for Algorithmic Trading, Stochastic Volatility Modeling, Quant Job Interview Questions, Distributed Systems, Language Models, Designing Machine Learning Systems, Designing Data-Intensive Applications (DDIA), Distributed Machine Learning, and The Elements of Quantitative Investing.

github

: 52

For similar tasks

No tools available

For similar jobs

No tools available