Awesome-LLMs-for-Video-Understanding

🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.

Stars: 1813

Visit

Awesome-LLMs-for-Video-Understanding is a repository dedicated to exploring Video Understanding with Large Language Models. It provides a comprehensive survey of the field, covering models, pretraining, instruction tuning, and hybrid methods. The repository also includes information on tasks, datasets, and benchmarks related to video understanding. Contributors are encouraged to add new papers, projects, and materials to enhance the repository.

README:

Awesome-LLMs-for-Video-Understanding

🔥🔥🔥 Video Understanding with Large Language Models: A Survey

Yunlong Tang^1,*, Jing Bi^1,*, Siting Xu^2,*, Luchuan Song¹, Susan Liang¹ , Teng Wang^2,3 , Daoan Zhang¹ , Jie An¹ , Jingyang Lin¹ , Rongyi Zhu¹ , Ali Vosoughi¹ , Chao Huang¹ , Zeliang Zhang¹ , Pinxin Liu¹ , Mingqian Feng¹ , Feng Zheng² , Jianguo Zhang² , Ping Luo³ , Jiebo Luo¹, Chenliang Xu^1,†. (*Core Contributors, †Corresponding Authors)

¹University of Rochester, ²Southern University of Science and Technology, ³The University of Hong Kong

Paper | Project Page

📢 News

[07/23/2024]

📢 We've recently updated our survey: “Video Understanding with Large Language Models: A Survey”!

✨ This comprehensive survey covers video understanding techniques powered by large language models (Vid-LLMs), training strategies, relevant tasks, datasets, benchmarks, and evaluation methods, and discusses the applications of Vid-LLMs across various domains.

🚀 What's New in This Update:
✅ Updated to include around 100 additional Vid-LLMs and 15 new benchmarks as of June 2024.
✅ Introduced a novel taxonomy for Vid-LLMs based on video representation and LLM functionality.
✅ Added a Preliminary chapter, reclassifying video understanding tasks from the perspectives of granularity and language involvement, and enhanced the LLM Background section.
✅ Added a new Training Strategies chapter, removing adapters as a factor for model classification.
✅ All figures and tables have been redesigned.

Multiple minor updates will follow this major update. And the GitHub repository will be gradually updated soon. We welcome your reading and feedback ❤️

Table of Contents

Awesome-LLMs-for-Video-Understanding

Why we need Vid-LLMs?

😎 Vid-LLMs: Models

📑 Citation

If you find our survey useful for your research, please cite the following paper:

@article{vidllmsurvey,
      title={Video Understanding with Large Language Models: A Survey}, 
      author={Tang, Yunlong and Bi, Jing and Xu, Siting and Song, Luchuan and Liang, Susan and Wang, Teng and Zhang, Daoan and An, Jie and Lin, Jingyang and Zhu, Rongyi and Vosoughi, Ali and Huang, Chao and Zhang, Zeliang and Zheng, Feng and Zhang, Jianguo and Luo, Ping and Luo, Jiebo and Xu, Chenliang},
      journal={arXiv preprint arXiv:2312.17432},
      year={2023},
}

🗒️ Taxonomy 1

🕹️ Video Analyzer × LLM

LLM as Summarizer

Title	Model	Date	Code	Venue
Seeing the Unseen: Visual Metaphor Captioning for Videos	GIT-LLaVA	06/2024	code	arXiv
Zero-shot long-form video understanding through screenplay	MM-Screenplayer	06/2024	project page	CVPR
MoReVQA exploring modular reasoning models for video question answering	MoReVQA	04/2024	project page	CVPR
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM	IG-VLM	03/2024	code	arXiv
Language repository for long video understanding	LangRepo	03/2024	code	arXiv
Understanding long videos in one multimodal language model pass	MVU	03/2024	code	arXiv
Video ReCap recursive captioning of hour-long videos	Video ReCap	02/2024	code	CVPR
A Simple LLM Framework for Long-Range Video Question-Answering	LLoVi	12/2023	code	arXiv
Grounding-prompter prompting LLM with multimodal information for temporal sentence grounding in long videos	Grounding-prompter	12/2023	code	arXiv
Learning object state changes in videos an open-world perspective	VIDOSC	12/2023	code	CVPR
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?	AntGPT	07/2023	code	ICLR
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	VAST	05/2023	code	NeurIPS
VLog: Video as a Long Document	VLog	04/2023	code	-
Learning Video Representations from Large Language Models	LaViLa	12/2022	code	CVPR

LLM as Manager

Title	Model	Date	Code	Venue
DrVideo: Document Retrieval Based Long Video Understanding	DrVideo	06/2024	code	arXiv
OmAgent a multi-modal agent framework for complex video understanding with task divide-and-conquer	OmAgent	06/2024	code	arXiv
Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA	LVNet	06/2024	code	arXiv
VideoTree adaptive tree-based video representation for LLM reasoning on long videos	VideoTree	05/2024	code	arXiv
Harnessing Large Language Models for Training-free Video Anomaly Detection	LAVAD	04/2024	code	CVPR
TraveLER a multi-LMM agent framework for video question-answering	TraveLER	04/2024	code	arXiv
GPTSee enhancing moment retrieval and highlight detection via description-based similarity features	GPTSee	03/2024	code	arXiv
Reframe anything LLM agent for open world video reframing	RAVA	03/2024	code	arXiv
SCHEMA state CHangEs MAtter for procedure planning in instructional videos	SCHEMA	03/2024	code	ICLR
TV-TREES multimodal entailment trees for neuro-symbolic video reasoning	TV-TREES	02/2024	code	arXiv
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding	VideoAgent	03/2024	project page	arXiv
VideoAgent long-form video understanding with large language model as agent	VideoAgent	03/2024	code	arXiv
VURF a general-purpose reasoning and self-refinement framework for video understanding	VURF	03/2024	code	arXiv
Why not use your textbook knowledge-enhanced procedure planning of instructional videos	KEPP	03/2024	code	CVPR
DoraemonGPT toward understanding dynamic scenes with large language models	DoraemonGPT	01/2024	code	arXiv
LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos	LifelongMemory	12/2023	code	arXiv
Zero-Shot Video Question Answering with Procedural Programs	ProViQ	12/2023	code	arXiv
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn	AssistGPT	06/2023	code	arXiv
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System	ChatVideo	04/2023	project page	arXiv
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions	Video ChatCaptioner	04/2023	code	arXiv
ViperGPT: Visual Inference via Python Execution for Reasoning	ViperGPT	03/2023	code	arXiv
Hawk: Learning to Understand Open-World Video Anomalies	Hawk	05/2024	code	arXiv

👾 Video Embedder × LLM

LLM as Text Decoder

Title	Model	Date	Code	Venue
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark	AuroraCap	10/2024	project page	arXiv
Artemis towards referential understanding in complex videos	Artemis	06/2024	code	arXiv
EmoLLM multimodal emotional understanding meets large language models	EmoLLM	06/2024	code	arXiv
Fewer tokens and fewer videos extending video understanding abilities in large vision-language models	FTFV-LLM	06/2024	-	arXiv
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams	Flash-VStream	06/2024	code	arXiv
LLAVIDAL benchmarking large language vision models for daily activities of living	LLAVIDAL	06/2024	code	arXiv
Long context transfer from language to vision	LongVA	06/2024	code	arXiv
ShareGPT4Video improving video understanding and generation with better captions	ShareGPT4Video	06/2024	code	arXiv
Towards event-oriented long video understanding	VIM	06/2024	code	arXiv
Video-SALMONN speech-enhanced audio-visual large language models	Video-SALMONN	06/2024	code	ICML
VideoGPT+ integrating image and video encoders for enhanced video understanding	VideoGPT+	06/2024	code	arXiv
VideoLLaMA 2 advancing spatial-temporal modeling and audio understanding in video-LLMs	VideoLLaMA 2	06/2024	code	arXiv
MotionLLM: Understanding Human Behaviors from Human Motions and Videos	MotionLLM	05/2024	project page	arXiv
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	VideoChat2	11/2023	code	CVPR
Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization	Shotluck Holmes	05/2024	-	arXiv
Streaming long video understanding with large language models	VideoStreaming	05/2024	-	arXiv
Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline	VideoNarrator	05/2024	-	arXiv
TOPA extend large language models for video understanding via text-only pre-alignment	TOPA	05/2024	code	NeurIPS
MovieChat+: Question-aware Sparse Memory for Long Video Question Answering	MovieChat+	04/2024	code	arXiv
AutoAD III: The Prequel – Back to the Pixels	AutoAD III	04/2024	project page	CVPR
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward	LLaVA-Hound-DPO	04/2024	code	arXiv
From image to video, what do we need in multimodal LLMs	RED-VILLM	04/2024	-	arXiv
Koala key frame-conditioned long video-LLM	Koala	04/2024	project page	CVPR
LongVLM efficient long video understanding via large language models	LongVLM	04/2024	code	ECCV
MA-LMM memory-augmented large multimodal model for long-term video understanding	MA-LMM	04/2024	code	CVPR
MiniGPT4-video advancing multimodal LLMs for video understanding with interleaved visual-textual tokens	MiniGPT4-Video	04/2024	code	arXiv
Pegasus-v1 technical report	Pegasus-v1	04/2024	code	arXiv
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning	PLLaVA	04/2024	code	arXiv
ST-LLM: Large Language Models Are Effective Temporal Learners	ST-LLM	04/2024	code	arXiv
Tarsier recipes for training and evaluating large video description models	Tarsier	07/2024	code	arXiv
X-VARS introducing explainability in football refereeing with multi-modal large language model	X-VARS	04/2024	code	arXiv
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios	CAT	03/2024	code	arXiv
InternVideo2 scaling video foundation models for multimodal video understanding	InternVideo2	03/2024	code	ECCV
MovieLLM enhancing long video understanding with AI-generated movies	MovieLLM	03/2024	code	arXiv
LLMs meet long video advancing long video comprehension with an interactive visual adapter in LLMs	IVAwithLLM	02/2024	code	arXiv
LSTP language-guided spatial-temporal prompt learning for long-form video-text understanding	LSTP	02/2024	code	EMNLP
LVCHAT facilitating long video comprehension	LVCHAT	02/2024	code	arXiv
OSCaR: Object State Captioning and State Change Representation	OSCaR	02/2024	code	NAACL
Slot-VLM SlowFast slots for video-language modeling	Slot-VLM	02/2024	code	arXiv
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training	COSMO	01/2024	code	arXiv
Weakly supervised gaussian contrastive grounding with large multimodal models for video question answering	GCG	01/2024	code	ACMMM
Audio-Visual LLM for Video Understanding	AV-LLM	12/2023	code	arXiv
Generative Multimodal Models are In-Context Learners	Emu2	12/2023	project page	CVPR
MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples	MMICT	12/2023	code	TOMM
VaQuitA : Enhancing Alignment in LLM-Assisted Video Understanding	VaQuitA	12/2023	code	arXiv
VILA: On Pre-training for Visual Language Models	VILA	12/2023	code	CVPR
Vista-LLaMA reliable video narrator via equal distance to visual tokens	Vista-LLaMA	12/2023	project page	arXiv
Chat-UniVi unified visual representation empowers large language models with image and video understanding	Chat-UniVi	11/2023	code	CVPR
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	LLaMA-VID	11/2023	code	arXiv
Video-LLaVA learning united visual representation by alignment before projection	Video-LLaVA	11/2023	code	arXiv
Large Language Models are Temporal and Causal Reasoners for Video Question Answering	LLaMA-VQA	10/2023	code	EMNLP
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	MovieChat	07/2023	code	CVPR
LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning	LLMVA-GEBC	06/2023	code	CVPR
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	Macaw-LLM	06/2023	project page	arXiv
Valley: Video Assistant with Large Language model Enhanced abilitY	VALLEY	06/2023	code	arXiv
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Video-ChatGPT	06/2023	code	ACL
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding	Video-LLaMA	06/2023	code	EMNLP
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	mPLUG-video	06/2023	code	arXiv
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst	ChatBridge	05/2023	code	arXiv
Otter: A Multi-Modal Model with In-Context Instruction Tuning	Otter	05/2023	code	arXiv
VideoLLM: Modeling Video Sequence with Large Language Models	VideoLLM	05/2023	code	arXiv

LLM as Regressor

Title	Model	Date	Code	Venue
LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval	LLaVA-MR	11/2024	code	arXiv
Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM	Holmes-VAD	06/2024	code	arXiv
VideoLLM-online online video large language model for streaming video	VideoLLM-online	06/2024	code	CVPR
HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision	VLM4HOI	04/2024	project page	arXiv
V2Xum-LLM cross-modal video summarization with temporal prompt instruction tuning	V2Xum-LLaMA	04/2024	code	arXiv
AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue	AVicuna	03/2024	code	arXiv
Elysium exploring object-level perception in videos via MLLM	Elysium	03/2024	code	arXiv
HawkEye training video-text LLMs for grounding text in videos	HawkEye	03/2024	code	arXiv
LITA language instructed temporal-localization assistant	LITA	03/2024	code	arXiv
OmniViD: A Generative Framework for Universal Video Understanding	OmniViD	03/2024	code	CVPR
GroundingGPT: Language Enhanced Multi-modal Grounding Model	GroundingGPT	01/2024	[code](https: //github.com/lzw-lzw/GroundingGPT)	arXiv
TimeChat a time-sensitive multimodal large language model for long video understanding	TimeChat	12/2023	code	CVPR
Self-Chained Image-Language Model for Video Localization and Question Answering	SeViLA	11/2023	code	NeurIPS
VTimeLLM: Empower LLM to Grasp Video Moments	VTimeLLM	11/2023	code	arXiv

LLM as Hidden Layer

Title	Model	Date	Code	Venue
VTG-LLM integrating timestamp knowledge into video LLMs for enhanced video temporal grounding	VTG-LLM	05/2024	code	arXiv
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing	VITRON	04/2024	project page	NeurIPS
VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT	VTG-GPT	03/2024	code	arXiv
Momentor advancing video large language model with fine-grained temporal reasoning	Momentor	02/2024	code	ICML
Detours for navigating instructional videos	VidDetours	01/2024	code	CVPR
OneLLM: One Framework to Align All Modalities with Language	OneLLM	12/2023	code	arXiv
GPT4Video a unified multimodal large language model for lnstruction-followed understanding and safety-aware generation	GPT4Video	11/2023	code	ACMMM

🧭 (Analyzer + Embedder) × LLM

LLM as Manager

Title	Model	Date	Code	Venue
MM-VID: Advancing Video Understanding with GPT-4V(ision)	MM-VID	10/2023	-	arXiv

LLM as Summarizer

Title	Model	Date	Code	Venue
Shot2Story20K a new benchmark for comprehensive understanding of multi-shot videos	SUM-shot	12/2023	code	arXiv

LLM as Regressor

Title	Model	Date	Code	Venue
Vript: A Video Is Worth Thousands of Words	Vriptor	06/2024	code	NeurIPS
Merlin:Empowering Multimodal LLMs with Foresight Minds	Merlin	12/2023	project page	ECCV
VideoChat: Chat-Centric Video Understanding	VideoChat	05/2023	code	arXiv
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning	Vid2Seq	02/2023	code	CVPR

LLM as Text Decoder

Title	Model	Date	Code	Venue
Contextual AD Narration with Interleaved Multimodal Sequence	Uni-AD	03/2024	code	arXiv
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning	MM-narrator	11/2023	project page	arXiv
Vamos: Versatile Action Models for Video Understanding	Vamos	11/2023	project page	ECCV
AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description	Auto-AD II	10/2023	project page	ICCV

LLM as Hidden Layer

Title	Model	Date	Code	Venue
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models	PG-Video-LLaVA	11/2023	code	arXiv

🗒️ Taxonomy 2

🤖 LLM-based Video Agents

Title	Model	Date	Code	Venue
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language	Socratic Models	04/2022	project page	arXiv
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions	Video ChatCaptioner	04/2023	code	arXiv
VLog: Video as a Long Document	VLog	04/2023	code	-
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System	ChatVideo	04/2023	project page	arXiv
MM-VID: Advancing Video Understanding with GPT-4V(ision)	MM-VID	10/2023	-	arXiv
MISAR: A Multimodal Instructional System with Augmented Reality	MISAR	10/2023	project page	ICCV
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos	Grounding-Prompter	12/2023	-	arXiv
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation	NaVid	02/2024	project page -	RSS
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding	VideoAgent	03/2024	project page	arXiv
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs	VideoINSTA	09/2024	code	EMNLP

🎥 Vid-LLM Pretraining

Title	Model	Date	Code	Venue
Learning Video Representations from Large Language Models	LaViLa	12/2022	code	CVPR
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning	Vid2Seq	02/2023	code	CVPR
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	VAST	05/2023	code	NeurIPS
Merlin:Empowering Multimodal LLMs with Foresight Minds	Merlin	12/2023	-	arXiv

👀 Vid-LLM Instruction Tuning

Fine-tuning with Connective Adapters

Title	Model	Date	Code	Venue
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding	Video-LLaMA	06/2023	code	arXiv
VALLEY: Video Assistant with Large Language model Enhanced abilitY	VALLEY	06/2023	code	-
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	Video-ChatGPT	06/2023	code	arXiv
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	Macaw-LLM	06/2023	code	arXiv
LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning	LLMVA-GEBC	06/2023	code	CVPR
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	mPLUG-video	06/2023	code	arXiv
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	MovieChat	07/2023	code	arXiv
Large Language Models are Temporal and Causal Reasoners for Video Question Answering	LLaMA-VQA	10/2023	code	EMNLP
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	Video-LLaVA	11/2023	code	arXiv
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding	Chat-UniVi	11/2023	code	arXiv
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	LLaMA-VID	11/2023	code	arXiv
VISTA-LLAMA: Reliable Video Narrator via Equal Distance to Visual Tokens	VISTA-LLAMA	12/2023	-	arXiv
Audio-Visual LLM for Video Understanding	-	12/2023	-	arXiv
AutoAD: Movie Description in Context	AutoAD	06/2023	code	CVPR
AutoAD II: The Sequel - Who, When, and What in Movie Audio Description	AutoAD II	10/2023	-	ICCV
AutoAD III: The Prequel -- Back to the Pixels	AutoAD III	04/2024	-	CVPR
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models	FAVOR	10/2023	code	arXiv
VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	VideoLLaMA2	06/2024	code	arXiv

Fine-tuning with Insertive Adapters

Title	Model	Date	Code	Venue
Otter: A Multi-Modal Model with In-Context Instruction Tuning	Otter	06/2023	code	arXiv
VideoLLM: Modeling Video Sequence with Large Language Models	VideoLLM	05/2023	code	arXiv

Fine-tuning with Hybrid Adapters

Title	Model	Date	Code	Venue
VTimeLLM: Empower LLM to Grasp Video Moments	VTimeLLM	11/2023	code	arXiv
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation	GPT4Video	11/2023	-	arXiv

🦾 Hybrid Methods

Title	Model	Date	Code	Venue
VideoChat: Chat-Centric Video Understanding	VideoChat	05/2023	code demo	arXiv
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models	PG-Video-LLaVA	11/2023	code	arXiv
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding	TimeChat	12/2023	code	CVPR
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding	Video-GroundingDINO	12/2023	code	arXiv
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot	Video4096	05/2023		EMNLP

💎 Training-free Methods

Title	Model	Date	Code	Venue
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models	SlowFast-LLaVA	07/2024	-	arXiv
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models	TS-LLaVA	11/2024	code	arXiv

Tasks, Datasets, and Benchmarks

Recognition and Anticipation

Name	Paper	Date	Link	Venue
Charades	Hollywood in homes: Crowdsourcing data collection for activity understanding	2016	Link	ECCV
YouTube8M	YouTube-8M: A Large-Scale Video Classification Benchmark	2016	Link	-
ActivityNet	ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding	2015	Link	CVPR
Kinetics-GEBC	GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval	2022	Link	ECCV
Kinetics-400	The Kinetics Human Action Video Dataset	2017	Link	-
VidChapters-7M	VidChapters-7M: Video Chapters at Scale	2023	Link	NeurIPS

Captioning and Description

Name	Paper	Date	Link	Venue
Microsoft Research Video Description Corpus (MSVD)	Collecting Highly Parallel Data for Paraphrase Evaluation	2011	Link	ACL
Microsoft Research Video-to-Text (MSR-VTT)	MSR-VTT: A Large Video Description Dataset for Bridging Video and Language	2016	Link	CVPR
Tumblr GIF (TGIF)	TGIF: A New Dataset and Benchmark on Animated GIF Description	2016	Link	CVPR
Charades	Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding	2016	Link	ECCV
Charades-Ego	Actor and Observer: Joint Modeling of First and Third-Person Videos	2018	Link	CVPR
ActivityNet Captions	Dense-Captioning Events in Videos	2017	Link	ICCV
HowTo100m	HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips	2019	Link	ICCV
Movie Audio Descriptions (MAD)	MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions	2021	Link	CVPR
YouCook2	Towards Automatic Learning of Procedures from Web Instructional Videos	2017	Link	AAAI
MovieNet	MovieNet: A Holistic Dataset for Movie Understanding	2020	Link	ECCV
Youku-mPLUG	Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	2023	Link	arXiv
Video Timeline Tags (ViTT)	Multimodal Pretraining for Dense Video Captioning	2020	Link	AACL-IJCNLP
TVSum	TVSum: Summarizing web videos using titles	2015	Link	CVPR
SumMe	Creating Summaries from User Videos	2014	Link	ECCV
VideoXum	VideoXum: Cross-modal Visual and Textural Summarization of Videos	2023	Link	IEEE Trans Multimedia
Multi-Source Video Captioning (MSVC)	VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	2024	Link	arXiv

Grounding and Retrieval

Name	Paper	Date	Link	Venue
Epic-Kitchens-100	Rescaling Egocentric Vision	2021	Link	IJCV
VCR (Visual Commonsense Reasoning)	From Recognition to Cognition: Visual Commonsense Reasoning	2019	Link	CVPR
Ego4D-MQ and Ego4D-NLQ	Ego4D: Around the World in 3,000 Hours of Egocentric Video	2021	Link	CVPR
Vid-STG	Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences	2020	Link	CVPR
Charades-STA	TALL: Temporal Activity Localization via Language Query	2017	Link	ICCV
DiDeMo	Localizing Moments in Video with Natural Language	2017	Link	ICCV

Question Answering

Name	Paper	Date	Link	Venue
MSVD-QA	Video Question Answering via Gradually Refined Attention over Appearance and Motion	2017	Link	ACM Multimedia
MSRVTT-QA	Video Question Answering via Gradually Refined Attention over Appearance and Motion	2017	Link	ACM Multimedia
TGIF-QA	TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering	2017	Link	CVPR
ActivityNet-QA	ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering	2019	Link	AAAI
Pororo-QA	DeepStory: Video Story QA by Deep Embedded Memory Networks	2017	Link	IJCAI
TVQA	TVQA: Localized, Compositional Video Question Answering	2018	Link	EMNLP
MAD-QA	Encoding and Controlling Global Semantics for Long-form Video Question Answering	2024	Link	EMNLP
Ego-QA	Encoding and Controlling Global Semantics for Long-form Video Question Answering	2024	Link	EMNLP

Video Instruction Tuning

Pretraining Dataset

Name	Paper	Date	Link	Venue
VidChapters-7M	VidChapters-7M: Video Chapters at Scale	2023	Link	NeurIPS
VALOR-1M	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	2023	Link	arXiv
Youku-mPLUG	Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	2023	Link	arXiv
InternVid	InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation	2023	Link	arXiv
VAST-27M	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	2023	Link	NeurIPS

Fine-tuning Dataset

Name	Paper	Date	Link	Venue
MIMIC-IT	MIMIC-IT: Multi-Modal In-Context Instruction Tuning	2023	Link	arXiv
VideoInstruct100K	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	2023	Link	arXiv
TimeIT	TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding	2023	Link	CVPR

Video-based Large Language Models Benchmark

Title	Date	Code	Venue
LVBench: An Extreme Long Video Understanding Benchmark	06/2024	code	-
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models	11/2023	code	-
Perception Test: A Diagnostic Benchmark for Multimodal Video Models	05/2023	code	NeurIPS 2023, ICCV 2023 Workshop
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks	07/2023	code	-
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation	11/2023	code	NeurIPS 2023
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding	12/2023	code	-
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	12/2023	code	-
TempCompass: Do Video LLMs Really Understand Videos?	03/2024	code	ACL 2024
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis	06/2024	code	-
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models	06/2024	code	-

Contributing

We welcome everyone to contribute to this repository and help improve it. You can submit pull requests to add new papers, projects, and helpful materials, or to correct any errors that you may find. Please make sure that your pull requests follow the "Title|Model|Date|Code|Venue" format. Thank you for your valuable contributions!

🌟 Star History

♥️ Contributors

Our project wouldn't be possible without the contributions of these amazing people! Thank you all for making this project better.

Yunlong Tang @ University of Rochester
Jing Bi @ University of Rochester
Siting Xu @ Southern University of Science and Technology
Luchuan Song @ University of Rochester
Susan Liang @ University of Rochester
Teng Wang @ The University of Hong Kong
Daoan Zhang @ University of Rochester
Jie An @ University of Rochester
Jingyang Lin @ University of Rochester
Rongyi Zhu @ University of Rochester
Ali Vosoughi @ University of Rochester
Chao Huang @ University of Rochester
Zeliang Zhang @ University of Rochester
Pinxin Liu @ University of Rochester
Mingqian Feng @ University of Rochester
Feng Zheng @ Southern University of Science and Technology
Jianguo Zhang @ Southern University of Science and Technology
Ping Luo @ University of Hong Kong
Jiebo Luo @ University of Rochester
Chenliang Xu @ University of Rochester

For Tasks:

Click tags to check more tools for each tasks

analyze videos generate video captions answer video-related questions train large language models improve video understanding

For Jobs:

data scientist machine learning engineer research scientist ai engineer video content creator

Alternative AI tools for Awesome-LLMs-for-Video-Understanding

Similar Open Source Tools

Awesome-LLMs-for-Video-Understanding

github

: 1.8k

llms-from-scratch-cn

This repository provides a detailed tutorial on how to build your own large language model (LLM) from scratch. It includes all the code necessary to create a GPT-like LLM, covering the encoding, pre-training, and fine-tuning processes. The tutorial is written in a clear and concise style, with plenty of examples and illustrations to help you understand the concepts involved. It is suitable for developers and researchers with some programming experience who are interested in learning more about LLMs and how to build them.

github

: 860

Awesome-Jailbreak-on-LLMs

Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, and exciting jailbreak methods on Large Language Models (LLMs). The repository contains papers, codes, datasets, evaluations, and analyses related to jailbreak attacks on LLMs. It serves as a comprehensive resource for researchers and practitioners interested in exploring various jailbreak techniques and defenses in the context of LLMs. Contributions such as additional jailbreak-related content, pull requests, and issue reports are welcome, and contributors are acknowledged. For any inquiries or issues, contact [email protected]. If you find this repository useful for your research or work, consider starring it to show appreciation.

github

: 507

Awesome-Knowledge-Distillation-of-LLMs

A collection of papers related to knowledge distillation of large language models (LLMs). The repository focuses on techniques to transfer advanced capabilities from proprietary LLMs to smaller models, compress open-source LLMs, and refine their performance. It covers various aspects of knowledge distillation, including algorithms, skill distillation, verticalization distillation in fields like law, medical & healthcare, finance, science, and miscellaneous domains. The repository provides a comprehensive overview of the research in the area of knowledge distillation of LLMs.

github

: 890

VoiceBench

VoiceBench is a repository containing code and data for benchmarking LLM-Based Voice Assistants. It includes a leaderboard with rankings of various voice assistant models based on different evaluation metrics. The repository provides setup instructions, datasets, evaluation procedures, and a curated list of awesome voice assistants. Users can submit new voice assistant results through the issue tracker for updates on the ranking list.

github

: 119

MindChat

MindChat is a psychological large language model designed to help individuals relieve psychological stress and solve mental confusion, ultimately improving mental health. It aims to provide a relaxed and open conversation environment for users to build trust and understanding. MindChat offers privacy, warmth, safety, timely, and convenient conversation settings to help users overcome difficulties and challenges, achieve self-growth, and development. The tool is suitable for both work and personal life scenarios, providing comprehensive psychological support and therapeutic assistance to users while strictly protecting user privacy. It combines psychological knowledge with artificial intelligence technology to contribute to a healthier, more inclusive, and equal society.

github

: 436

stylellm_models

**stylellm** is a text style transfer project based on large language models (llms). The project utilizes large language models to learn the writing style of specific literary works (commonly used vocabulary, sentence structure, rhetoric, character dialogue, etc.), forming a series of specific style models. Using the **stylellm** model, the learned style can be transferred to other general texts, that is: input a piece of original text, the model can rewrite it, output text with the characteristics of that style, achieving the effect of text modification,润色or style imitation.

github

: 121

awesome-VLLMs

github

: 52

Awesome-Model-Merging-Methods-Theories-Applications

A comprehensive repository focusing on 'Model Merging in LLMs, MLLMs, and Beyond', providing an exhaustive overview of model merging methods, theories, applications, and future research directions. The repository covers various advanced methods, applications in foundation models, different machine learning subfields, and tasks like pre-merging methods, architecture transformation, weight alignment, basic merging methods, and more.

github

: 347

LLM4EC

LLM4EC is an interdisciplinary research repository focusing on the intersection of Large Language Models (LLM) and Evolutionary Computation (EC). It provides a comprehensive collection of papers and resources exploring various applications, enhancements, and synergies between LLM and EC. The repository covers topics such as LLM-assisted optimization, EA-based LLM architecture search, and applications in code generation, software engineering, neural architecture search, and other generative tasks. The goal is to facilitate research and development in leveraging LLM and EC for innovative solutions in diverse domains.

github

: 78

go-cyber

Cyber is a superintelligence protocol that aims to create a decentralized and censorship-resistant internet. It uses a novel consensus mechanism called CometBFT and a knowledge graph to store and process information. Cyber is designed to be scalable, secure, and efficient, and it has the potential to revolutionize the way we interact with the internet.

github

: 353

cool-ai-stuff

This repository contains an uncensored list of free to use APIs and sites for several AI models. > _This list is mainly managed by @zukixa, the queen of zukijourney, so any decisions may have bias!~_ > > **Scroll down for the sites, APIs come first!** * * * > [!WARNING] > We are not endorsing _any_ of the listed services! Some of them might be considered controversial. We are not responsible for any legal, technical or any other damage caused by using the listed services. Data is provided without warranty of any kind. **Use these at your own risk!** * * * # APIs Table of Contents #### Overview of Existing APIs #### Overview of Existing APIs -- Top LLM Models Available #### Overview of Existing APIs -- Top Image Models Available #### Overview of Existing APIs -- Top Other Features & Models Available #### Overview of Existing APIs -- Available Donator Perks * * * ## API List:* *: This list solely covers all providers I (@zukixa) was able to collect metrics in. Any mistakes are not my responsibility, as I am either banned, or not aware of x API. \ 1: Last Updated 4/14/24 ### Overview of APIs: | Service | # of Users1 | Link | Stablity | NSFW Ok? | Open Source? | Owner(s) | Other Notes | | ----------- | ---------- | ------------------------------------------ | ------------------------------------------ | --------------------------- | ------------------------------------------------------ | -------------------------- | ----------------------------------------------------------------------------------------------------------- | | zukijourney| 4441 | D | High | On /unf/, not /v1/ | ✅, Here | @zukixa | Largest & Oldest GPT-4 API still continuously around. Offers other popular AI-related Bots too. | | Hyzenberg| 1234 | D | High | Forbidden | ❌ | @thatlukinhasguy & @voidiii | Experimental sister API to Zukijourney. Successor to HentAI | | NagaAI | 2883 | D | High | Forbidden | ❌ | @zentixua | Honorary successor to ChimeraGPT, the largest API in history (15k users). | | WebRaftAI | 993 | D | High | Forbidden | ❌ | @ds_gamer | Largest API by model count. Provides a lot of service/hosting related stuff too. | | KrakenAI | 388 | D | High | Discouraged | ❌ | @paninico | It is an API of all time. | | ShuttleAI | 3585 | D | Medium | Generally Permitted | ❌ | @xtristan | Faked GPT-4 Before 1, 2 | | Mandrill | 931 | D | Medium | Enterprise-Tier-Only | ❌ | @fredipy | DALL-E-3 access pioneering API. Has some issues with speed & stability nowadays. | oxygen | 742 | D | Medium | Donator-Only | ❌ | @thesketchubuser | Bri'ish 🤮 & Fren'sh 🤮 | | Skailar | 399 | D | Medium | Forbidden | ❌ | @aquadraws | Service is the personification of the word 'feature creep'. Lots of things announced, not much operational. |

github

: 907

SpinQuant

SpinQuant is a tool designed for LLM quantization with learned rotations. It focuses on optimizing rotation matrices to enhance the performance of quantized models, narrowing the accuracy gap to full precision models. The tool implements rotation optimization and PTQ evaluation with optimized rotation, providing arguments for model name, batch sizes, quantization bits, and rotation options. SpinQuant is based on the findings that rotation helps in removing outliers and improving quantization, with specific enhancements achieved through learning rotation with Cayley optimization.

github

: 76

Chinese-LLaMA-Alpaca

This project open sources the **Chinese LLaMA model and the Alpaca large model fine-tuned with instructions**, to further promote the open research of large models in the Chinese NLP community. These models **extend the Chinese vocabulary based on the original LLaMA** and use Chinese data for secondary pre-training, further enhancing the basic Chinese semantic understanding ability. At the same time, the Chinese Alpaca model further uses Chinese instruction data for fine-tuning, significantly improving the model's understanding and execution of instructions.

github

: 17.2k

2025-AI-College-Jobs

2025-AI-College-Jobs is a repository containing a comprehensive list of AI/ML & Data Science jobs suitable for college students seeking internships or new graduate positions. The repository is regularly updated with positions posted within the last 120 days, featuring opportunities from various companies in the USA and internationally. The list includes positions in areas such as research scientist internships, quantitative research analyst roles, and other data science-related positions. The repository aims to provide a valuable resource for students looking to kickstart their careers in the field of artificial intelligence and machine learning.

github

: 988

gpupixel

GPUPixel is a real-time, high-performance image and video filter library written in C++11 and based on OpenGL/ES. It incorporates a built-in beauty face filter that achieves commercial-grade beauty effects. The library is extremely easy to compile and integrate with a small size, supporting platforms including iOS, Android, Mac, Windows, and Linux. GPUPixel provides various filters like skin smoothing, whitening, face slimming, big eyes, lipstick, and blush. It supports input formats like YUV420P, RGBA, JPEG, PNG, and output formats like RGBA and YUV420P. The library's performance on devices like iPhone and Android is optimized, with low CPU usage and fast processing times. GPUPixel's lib size is compact, making it suitable for mobile and desktop applications.

github

: 1.7k

For similar tasks

Awesome-LLMs-for-Video-Understanding

github

: 1.8k

Video-MME

Video-MME is the first-ever comprehensive evaluation benchmark of Multi-modal Large Language Models (MLLMs) in Video Analysis. It assesses the capabilities of MLLMs in processing video data, covering a wide range of visual domains, temporal durations, and data modalities. The dataset comprises 900 videos with 256 hours and 2,700 human-annotated question-answer pairs. It distinguishes itself through features like duration variety, diversity in video types, breadth in data modalities, and quality in annotations.

github

: 225

ControlLLM

ControlLLM is a framework that empowers large language models to leverage multi-modal tools for solving complex real-world tasks. It addresses challenges like ambiguous user prompts, inaccurate tool selection, and inefficient tool scheduling by utilizing a task decomposer, a Thoughts-on-Graph paradigm, and an execution engine with a rich toolbox. The framework excels in tasks involving image, audio, and video processing, showcasing superior accuracy, efficiency, and versatility compared to existing methods.

github

: 174

gen-cv

This repository is a rich resource offering examples of synthetic image generation, manipulation, and reasoning using Azure Machine Learning, Computer Vision, OpenAI, and open-source frameworks like Stable Diffusion. It provides practical insights into image processing applications, including content generation, video analysis, avatar creation, and image manipulation with various tools and APIs.

github

: 417

outspeed

Outspeed is a PyTorch-inspired SDK for building real-time AI applications on voice and video input. It offers low-latency processing of streaming audio and video, an intuitive API familiar to PyTorch users, flexible integration of custom AI models, and tools for data preprocessing and model deployment. Ideal for developing voice assistants, video analytics, and other real-time AI applications processing audio-visual data.

github

: 325

starter-applets

This repository contains the source code for Google AI Studio's starter apps — a collection of small apps that demonstrate how Gemini can be used to create interactive experiences. These apps are built to run inside AI Studio, but the versions included here can run standalone using the Gemini API. The apps cover spatial understanding, video analysis, and map exploration, showcasing Gemini's capabilities in these areas. Developers can use these starter applets to kickstart their projects and learn how to leverage Gemini for spatial reasoning and interactive experiences.

github

: 467

TRACE

TRACE is a temporal grounding video model that utilizes causal event modeling to capture videos' inherent structure. It presents a task-interleaved video LLM model tailored for sequential encoding/decoding of timestamps, salient scores, and textual captions. The project includes various model checkpoints for different stages and fine-tuning on specific datasets. It provides evaluation codes for different tasks like VTG, MVBench, and VideoMME. The repository also offers annotation files and links to raw videos preparation projects. Users can train the model on different tasks and evaluate the performance based on metrics like CIDER, METEOR, SODA_c, F1, mAP, Hit@1, etc. TRACE has been enhanced with trace-retrieval and trace-uni models, showing improved performance on dense video captioning and general video understanding tasks.

github

: 54

Dispider

Dispider is an implementation enabling real-time interactions with streaming videos, providing continuous feedback in live scenarios. It separates perception, decision-making, and reaction into asynchronous modules, ensuring timely interactions. Dispider outperforms VideoLLM-online on benchmarks like StreamingBench and excels in temporal reasoning. The tool requires CUDA 11.8 and specific library versions for optimal performance.

github

: 92

For similar jobs

sweep

Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

github

: 7.1k

teams-ai

The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

github

: 502

ai-guide

This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

github

: 159

classifai

Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

github

: 620

chatbot-ui

Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

github

: 27.7k

BricksLLM

BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

github

: 953

uAgents

uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

github

: 1.3k

griptape

Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.

github

: 2.2k