SoM-LLaVA

SoM-LLaVA

Empowering Multimodal LLMs with Set-of-Mark Prompting and Improved Visual Reasoning Ability.

Stars: 92

Visit
 screenshot

SoM-LLaVA is a new data source and learning paradigm for Multimodal LLMs, empowering open-source Multimodal LLMs with Set-of-Mark prompting and improved visual reasoning ability. The repository provides a new dataset that is complementary to existing training sources, enhancing multimodal LLMs with Set-of-Mark prompting and improved general capacity. By adding 30k SoM data to the visual instruction tuning stage of LLaVA, the tool achieves 1% to 6% relative improvements on all benchmarks. Users can train SoM-LLaVA via command line and utilize the implementation to annotate COCO images with SoM. Additionally, the tool can be loaded in Huggingface for further usage.

README:

📝 [COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

Empowering Open-Source Multimodal LLMs with Set-of-Mark Prompting and Improved Visual Reasoning Ability.

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs [Paper] [HF Model]

📣 Note: Our new dataset is complementary to existing training sources, add it to your train set and boost your multimodal LLMs with Set-of-Mark prompting and improved general capacity! No cost at inference time!

🔥 News

  • [07/10] Our paper is accepted at COLM-2024, see you in Philly!
  • [04/26] Thanks AK and HF daily papers for featuring our work!
  • [04/25] Our paper is on arxiv! [Paper]
  • [04/23] Models and datasets of SoM-LLaVA are released! [HF Model] [Dataset]

📜 Contents

📊 Results

Method LLM POPE MME SEED-I LLaVA-Wild MM-VET
BLIP-2 Vicuna-13B 85.3 1293.8 49.7 38.1 22.4
LLaVA-1.5 Vicuna-13B 85.9 1531.3 68.2 70.7 35.4
SoM-LLaVA-1.5 Vicuna-13B 86.6 1563.1 69.6 75.3 35.9
SoM-LLaVA-1.5 w/ tags Vicuna-13B 87.0 1572.8 69.5 73.3 37.2

📣 Note: We get 1% to 6% relative improvements on all benchmarks, by simply adding 30k SoM data to the visual instruction tuning stage of LLaVA. SoM-LLaVA-1.5 w/ tags is to feed the model with tagged images, but you can enjoy the performance gain even without the extra visual prompts at test time!

🌱 SoM Dataset

[Training data for SoM-LLaVA]

som_llava_mix695k.json: Full SFT data with llava-665k + SoM-30k

som_listing_coco10k.json: listing all items with SoM images.

som_qa_coco20k.json: QA with SoM images. (Note: QA used the same 10k images from listing, with another batch of 10k added.)

som_train2017.zip: A subset of 20k coco images that is annotated with SoM, used in our data construction.

🍰 Model Checkpoints

We release our main model, SoM-LLaVA trained with LLaVA-665k and SoM-style Listing + QA data.

[SoM-LLaVA-v1.5-13B] (model weights in original LLaVA format, load and eval with LLaVA)

[SoM-LLaVA-v1.5-13B-HF] (model weights converted into HF format, see usage below)

Two additional models for ablation study:

[SoM-LLaVA-v1.5-13B-listing]

[SoM-LLaVA-v1.5-13B-qa]

🍡 Showcases

🍄 Training

We adopt the training code of LLaVA. Please set up environments following the instructions. Currently our data is used in the Visual Instruction Tuning stage.

  1. Prepare data

Please download the annotation of the final mixture of our instruction tuning data som_llava_mix695k.json , which is a mixture of llava_mix665k and 30k SoM data, and download the images from the following datasets:

After downloading all of them, organize the data as follows in your data folder.

├── coco
│   ├── train2017
│   └── som_train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
    ├── VG_100K
    └── VG_100K_2
  1. Training

After downloading our data (or preparing your own SoM data), train SoM-LLaVA via command line:

bash scripts/v1_5/finetune.sh

❄️ Using SoM

Note: Our implementation is improved over the original SoM repo, by removing overlapping regions for each mask (otherwise there will be confilicts/overlaps for tag positions).

  • Init virtual envs
# create env. Note: must use 3.10, 3.11 will cause package conflicts.
conda create -n som python=3.10 -y
conda activate som
  • Install libgeos if there is error installing SEEM
sudo apt-get update
sudo apt-get install libgeos-c1v5 libgeos-dev
  • Install segmentation packages
# download repo and navigate to SoM folder
git clone https://github.com/zzxslp/SoM-LLaVA.git
cd ~/SoM-LLaVA/SoM/

# install PyTorch
pip3 install torch torchvision torchaudio

# install SEEM
pip install git+https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git@package
# install SAM
pip install git+https://github.com/facebookresearch/segment-anything.git
# install Semantic-SAM
pip install git+https://github.com/UX-Decoder/Semantic-SAM.git@package
# install Deformable Convolution for Semantic-SAM
cd ops && sh make.sh && cd ..

# common error fix:
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'

# install additional packages
pip install datasets
  • Download the pretrained models
sh download_ckpt.sh
  • Annotate COCO images with SoM
python annotate_coco.py

😊 Using LLaVA in HF

If you would like to load our model in huggingface, here is an example script:

from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_path = "zzxslp/som-llava-v1.5-13b-hf"

model = LlavaForConditionalGeneration.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)

prompt = "USER: <image>\nWhat's the content of the image? ASSISTANT:"
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=prompt, images=image, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_new_tokens=20)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print (output)

Note: to reproduce the results reported in the paper, we recommend using the official LLaVA repo with our LLaVA-format model.

🐱 Citation

If you find our data or model useful for your research and applications, please cite our paper:

@article{yan2024list,
  title={List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs},
  author={Yan, An and Yang, Zhengyuan and Wu, Junda and Zhu, Wanrong and Yang, Jianwei and Li, Linjie and Lin, Kevin and Wang, Jianfeng and McAuley, Julian and Gao, Jianfeng and others},
  journal={arXiv preprint arXiv:2404.16375},
  year={2024}
}

🍻 Acknowledgments

This project is a collaborative work between UC San Diego and Microsoft GenAI, built on top of LLaVA and SoM. Thank the authors for their contributions to the community!

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for SoM-LLaVA

Similar Open Source Tools

For similar tasks

For similar jobs