Grounded-Video-LLM

Grounded-Video-LLM

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Stars: 87

Visit
 screenshot

Grounded-VideoLLM is a Video Large Language Model specialized in fine-grained temporal grounding. It excels in tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA. The model incorporates an additional temporal stream, discrete temporal tokens with specific time knowledge, and a multi-stage training scheme. It shows potential as a versatile video assistant for general video understanding. The repository provides pretrained weights, inference scripts, and datasets for training. Users can run inference queries to get temporal information from videos and train the model from scratch.

README:

🌟 This is the official repository for the video large langauge model : Grounded-VideoLLM, a Video-LLM adept at fine-grained temporal grounding. Grounded-VideoLLM not only excels in grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA, but also shows great potential as a versatile video assistant for general video understanding.

arXiv hf_space


💡 We sharpen our model by incorporating:

  • An additional temporal stream to encode the relationships between frames.
  • Discrete temporal tokens enriched with specific time knowledge to represent timestamps.
  • A multi-stage training scheme, beginning with simple video-captioning tasks and progressively introducing video temporal grounding tasks of increasing complexity. To further enhance the temporal reasoning capability, we also curate a grounded VideoQA dataset by an automatic annotation pipeline.

📰 News

  • [x] [2024.10.4] Release the inference scripts and pretrained checkpoints.
  • [x] [2024.10.4] Release the annotated grounded-VideoQA dataset .
  • [x] [2024.10.4] Release the Phi3.5-Vision-Instruct version.
  • [x] [2024.10.29] Release the LLaVA-Next-LLAMA3-8B version, with stronger performance in both grounding tasks and general benchmarks.
  • [x] Release the training scripts and training datasets. We will try to adapt more MLLMs as the base model for Grounded-VideoLLM in future.

Performance

Model Name LLM Charades-STA ([email protected]/[email protected]/[email protected]/mIoU) ActivityNet-Groudning ([email protected]/[email protected]/[email protected]/mIoU) ActivityNet-Captions (SODA_c/METEOR) NEXT-GQA (GQA/mIoP/mIoU) MVbench Video-MME (w/o subs)
Grounded-VideoLLM Phi3.5-3.8B 54.2/36.4/19.7/36.8 46.2/30.3/19.0/36.1 6.0/6.8 26.7/34.5/21.1 59.4 47.7
Grounded-VideoLLM (*) Phi3.5-3.8B 70.2/55.9/33.2/49.4 64.9/47.8/30.4/47.2 6.6/6.5 29.4/37.4/27.0 60.0 48.1
  • (*) means we incorporate a sub training set of Charades-STA and ActivityNet into the third training stage. Please refer to our paper for more results.

🛠️ Install

  1. Clone this repository and navigate to folder
git clone https://github.com/WHB139426/Grounded-Video-LLM.git
cd Grounded-Video-LLM
  1. Install Package
conda create -n grounded-videollm python=3.10.14
conda activate grounded-videollm
pip install torch==2.1.2 torchaudio==2.1.2 torchvision==0.16.2 torchdata==0.8.0 # to make sure install torch before flash-attn
pip install -r requirements.txt
pip install numpy==1.26.4 # to make sure numpy<2.0

Some installation suggestions

  • We recommend you to pip install flash-attn==2.3.3 and run the model with torch.bfloat16. If your device doesn't support these, you can skip them and replace the argparse parameter attn_implementation and dtype in inference.py, which may result in subtle numerical difference.

🤗 Prepare the pretrained weights

Set your own weight_path to storage the pretrained weights. The folder should be organized as follows:

├── Grounded-Video-LLM
│   └── inference.py
│   └── models
│   └── mm_utils
│   └── training
│   └── scripts
│   └── ...
├── weight_path
│   └── Phi-3.5-mini-instruct
│   └── Phi-3.5-vision-instruct-seperated
│   └── Phi-3.5-vision-instruct
│   └── llama3-llava-next-8b
│   └── llama3-llava-next-8b-seperated
│   └── Meta-Llama-3-8B-Instruct
│   └── ckpt
│   └── internvideo
│   └──...

Download the pretrained weights [🤗HF] in your own weight_path.

🚀 Qucik Start

We give a brief example to run the inference code. We recommend GPUs with 24GB memeroy for Phi3.5 version, while 32GB memeroy for LLaVA-Next-LLAMA3-8B version.

  1. replace the parameter weight_path in scripts/inference_phi3_5.sh or scripts/inference_llama3.sh with your own weight_path that you set above.
  2. run the command bash scripts/inference_phi3_5.sh or bash scripts/inference_llama3.sh to reproduce the example below:

https://private-user-images.githubusercontent.com/115783170/373608088-69f83fd7-59d2-4105-a766-cd712d14d425.mp4

[USER] Give you a textual query: "The female host wearing purple clothes is reporting news in the studio". When does the described content occur in the video? Please return the start and end timestamps.
[Grounded-VideoLLM] From 14.20 seconds to 25.09 seconds.

[USER] Give you a textual query: "A sign written with 'NO TRESPASSING LOITERING DRUGS'". When does the described content occur in the video? Please return the start and end timestamps.
[Grounded-VideoLLM] From 107.95 seconds to 113.16 seconds.

[USER] What is happening from 70 seconds to 80 seconds?
[Grounded-VideoLLM] A woman with glasses and a red shirt is talking to a reporter.

[USER] Why was the man in green clothes interviewed?
[Grounded-VideoLLM] The man in green clothes was interviewed to provide his perspective on the incident and the history of violence in the apartment complex.

[USER] Question: What does this TV news report about?\nOptions:\n(A) thievery\n(B) community violence incidents\n(C) fashion show\n(D) aging population
[Grounded-VideoLLM] Answer: (B) community violence incidents
  1. You can change the parameter of prompt_grounding, prompt_videoqa, prompt_referring and video_path in inference.py's argparse to run your own case.

🎬 Grounded-VideoQA dataset

We provide the Grounded-VideoQA dataset that we annotated with GPT-4o-mini in [🤗HF]. You can download the videos following [ActivityNet] and [QVHighlights].

💡 Training Grounded-VideoLLM from scratch

  1. Prepare your training data: Set your own data_path in th following .sh files to storage the data before running the commands (You can choose to only download the specific data you want by modifying the following scripts):
bash scripts/download_data_stage1.sh
bash scripts/download_data_stage2.sh
bash scripts/download_data_stage2.sh
  1. Set up the data_dir and weight_path in scripts/phi3.5_xxx_8_a100.sh where xxx can be [pretrain, grounded, sft], and then run the following commands:
bash scripts/phi3.5_pretrain_8_a100.sh
bash scripts/phi3.5_grounded_8_a100.sh
bash scripts/phi3.5_sft_8_a100.sh

The checkpoints will be saved at ./experiments

✏️ Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.

@article{wang2024grounded,
  title={Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models},
  author={Wang, Haibo and Xu, Zhiyang and Cheng, Yu and Diao, Shizhe and Zhou, Yufan and Cao, Yixin and Wang, Qifan and Ge, Weifeng and Huang, Lifu},
  journal={arXiv preprint arXiv:2410.03290},
  year={2024}
}

🤝 Acknowledgement

We are grateful for the following awesome projects our Grounded-VideoLLM arising from: Prismatic-VLMs, Phi-3.5-vision-instruct, InternVideo2, LLaVA-Next, TimeChat, VTimeLLM, Momentor.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for Grounded-Video-LLM

Similar Open Source Tools

For similar tasks

For similar jobs