Grounded_3D-LLM

Grounded_3D-LLM

Code&Data for Grounded 3D-LLM with Referent Tokens

Stars: 97

Visit
 screenshot

Grounded 3D-LLM is a unified generative framework that utilizes referent tokens to reference 3D scenes, enabling the handling of sequences that interleave 3D and textual data. It transforms 3D vision tasks into language formats through task-specific prompts, curating grounded language datasets and employing Contrastive Language-Scene Pre-training (CLASP) to bridge the gap between 3D vision and language models. The model covers tasks like 3D visual question answering, dense captioning, object detection, and language grounding.

README:

Grounded 3D-LLM with Referent Tokens

This repository will release the official implementation of "Grounded 3D-LLM with Referent Token".

[Paper] [Arxiv] [Website] [Data]

Abstract

Prior studies on 3D scene comprehension have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (LMMs) to consolidate various 3D visual tasks within a unified generative framework. The model utilizes a series of ``referent tokens'' to reference 3D scenes, enabling the handling of sequences that interleave 3D and textual data arbitrarily. 3D vision tasks are naturally transformed into language formats through task-specific prompts. To effectively associate the scene with text, we curate the grounded language datasets either from human-annotated sources or by bootstrapping existing object labels at the phrase level. We then employ Contrastive Language-Scene Pre-training (CLASP) to bridge the divide between 3D vision and language models, thus facilitating the use of referent tokens in subsequent language modeling tasks. Our comprehensive evaluation covers open-ended tasks like 3D visual question answering and dense captioning, as well as close-ended tasks such as object detection and language grounding.

image-20240515195822834

Grounded Scene Caption Data Visualization

Please refer to the data visualization page for detailed instructions on the minimal setup for visualizing the grounded scene caption dataset.

Model Training

Step 1: Environment setups and dataset preparation.

Grounded 3D-LLM is trained using 4 or 8 NVIDIA Tesla A100 GPUs. Please refer to the installation page for detailed installation scripts for model training.

Please download all the scene-language datasets the from HuggingFace. The datasets are listed as follows:

Dataset # for Train # for Eval
ScanRefer 36639 9503
Scan2Cap 36639 9503
ScanQA 26516 9402
Object-Description 28197 7912
GroundedSceneCaption 84301 --
EmbodiedPlanning 3500 --
EmbodiedDialogue 129799 --
GlobalSceneCaption 4065 --
3D-LLM 27627 --
Alpaca 51865 --

Please download the pretrained weights from HuggingFace and place them in $ROOT_PATH/pretrained/.

Please download the pretrained LLM weights (Tiny-Vicuna-1B) and store them in $ROOT_PATH/pretrained/llm_weight/Tiny-Vicuna-1B/

If you would like to utilize our pretrained model checkpoints, they can be obtained from HuggingFace. Please save these in the checkpoint directory located at $ROOT_PATH/saved.

Steps Model Checkpoints
1 Mask3D-CLIP
2 Mask3D-CLASP
3 Grounded 3D-LLM

After completing the downloads, the root folder should be organized as follows:

ROOT_PATH
├── data                            # data
│   ├── langdata
│   │   │── groundedscenecaption_format.json
│   │   │── scanrefer_format.json
│   │   │── scanqa_format.json
│   │   │── ...
│   ├── processed
│   │── rawscannet
│   │   │── scans
│   │   │── scannetv2-labels.combined.tsv
│── pretrained                      # pretrained weights for model training
│   │── bert-base-uncased           # bert pretrained weights
│   │── label_clip_features.pth     # clip's text features for scannet-200 class names
│   │── llm_weight
│   │   │── Tiny-Vicuna-1B          # pretrained weights from https://huggingface.co/Jiayi-Pan/Tiny-Vicuna-1B
│── saved                           # model checkpoints saved path
│   │── step1_mask3d_clip_4GPUS
│   │── step2_mask3d_lang_4GPUS
│   │── step3_mask3d_lang_4GPUS

Step 2: Pre-train the Mask3D detector:

bash final_scripts/step1_pretrain_detector.sh

Step 3: After training the detector, pre-train the detector using Contrastive Language-Scene Pre-training:

bash final_scripts/step2_pretrain_3d-clasp.sh

Step 3: After contrastive pre-training, train and inference the entire Grounded 3D-LLM:

bash final_scripts/step3_train_grounded3dllm.sh

The model checkpoints will be saved in saved/step3_mask3d_lang_4GPUS/last-checkpoint.pth, and the inference results will be stored in saved/step3_mask3d_lang_4GPUS/${TIMESTAMP}/.

Note that if you want to run the inference, simply comment out the training command in these scripts.

Model Evaluation

To evaluate all the respective results, run the following command:

bash final_scripts/eval_llm.sh ./saved/step3_mask3d_lang_4GPUS/${TIMESTAMP}/

Demo

To interact with Grounded 3D-LLM via the demo chat, first run the model inference and ensure that the scene_features are saved in saved/step3_mask3d_lang_4GPUS/scene_features. After that, launch the gradio demo chat by running the following command:

bash web_chat_demo/web_chat_demo.sh 

Please note that the visualization of the related segmentation masks is not yet supported in the Gradio demo.

ToDo List

  • [x] Release Grouded Scene Caption data (ScanNet).
  • [x] Release data visualizer.
  • [x] Release data generation code.
  • [x] Release pre-trained checkpoints.
  • [x] Release Grounded 3D-LLM training and evaluation.
  • [ ] Demo supports mask visualization.

Acknowledgement

Many thanks to the following open-source projects:

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for Grounded_3D-LLM

Similar Open Source Tools

For similar tasks

For similar jobs