MMStar

MMStar

This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"

Stars: 84

Visit
 screenshot

MMStar is an elite vision-indispensable multi-modal benchmark comprising 1,500 challenge samples meticulously selected by humans. It addresses two key issues in current LLM evaluation: the unnecessary use of visual content in many samples and the existence of unintentional data leakage in LLM and LVLM training. MMStar evaluates 6 core capabilities across 18 detailed axes, ensuring a balanced distribution of samples across all dimensions.

README:

MMStar

🌐 Homepage | πŸ€— Dataset | πŸ“– Paper | πŸ† Leaderboard

This repo contains the official evaluation code and dataset for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models?"

πŸ’‘ Highlights

  • πŸ”₯ Two Key Issues that Lead to Misjudgment of the current LVLMs' Capabilities
  • πŸ”₯ An Elite Vision-indispensable Multi-modal Benchmark, MMStar
  • πŸ”₯ Two Metrics: Multi-modal Gain (MG) and Multi-modal Leakage (ML)

πŸ“œ News

[2024.4.16] πŸš€ MMStar has been supported in the VLMEvalKit repository and OpenCompass leaderboard.

[2024.4.2] πŸš€ Huggingface Dataset and evaluation code are available!

[2024.4.1] πŸš€ We released the ArXiv paper.

πŸ‘¨β€πŸ’» Todo

  • [x] Evaluation code for MMStar
  • [x] Support online Leaderboard
  • [ ] Curate online test set, MMStar-test (This involves working with the existing multi-modal benchmarks containing protected test set, feel free to contact us!)

πŸ‘€ Introduction

We dig into current evaluation works and identify two primary issues:

(1) Visual content is unnecessary for many samples.

(2) Unintentional data leakage exists in LLM and LVLM training.

Both problems lead to misjudgments of actual multi-modal performance gains and potentially misguide the study of LVLM. To this end, we present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 challenge samples meticulously selected by humans. After applying the coarse filter process and manual review, we narrow down from a total of 22,401 samples to 11,607 candidate samples and finally select 1,500 high-quality samples to construct our MMStar benchmark.

In MMStar, we display 6 core capabilities in the inner ring, with 18 detailed axes presented in the outer ring. The middle ring showcases the number of samples for each detailed dimension. Each core capability contains a meticulously balanced 250 samples. We further ensure a relatively even distribution across the 18 detailed axes.

πŸ€– Evaluation

You can evaluate any LLMs and LVLMs on our MMStar following with the evaluation guidelines.

πŸ† Leaderboard

🎯 The Leaderboard for MMStar is continuously being updated, welcoming the contribution of your LVLMs!

Please note that to thoroughly evaluate your own LVLM, you are required to provide us with three result files in xlsx format. These should include the results of your LVLM with visual input, the results of your LVLM without visual input, and the results of your original LLM base without visual input. We have provided a submission format in the submits folder. After completing the aforementioned steps, please contact us via [email protected] to submit your results and to update the leaderboard.

πŸ“§ Contact

βœ’οΈ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation πŸ“

@article{chen2024we,
  title={Are We on the Right Way for Evaluating Large Vision-Language Models?},
  author={Chen, Lin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Wang, Jiaqi and Qiao, Yu and Lin, Dahua and others},
  journal={arXiv preprint arXiv:2403.20330},
  year={2024}
}

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for MMStar

Similar Open Source Tools

For similar tasks

For similar jobs