MathVerse

MathVerse

[ECCV 2024] Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Stars: 115

Visit
 screenshot

MathVerse is an all-around visual math benchmark designed to evaluate the capabilities of Multi-modal Large Language Models (MLLMs) in visual math problem-solving. It collects high-quality math problems with diagrams to assess how well MLLMs can understand visual diagrams for mathematical reasoning. The benchmark includes 2,612 problems transformed into six versions each, contributing to 15K test samples. It also introduces a Chain-of-Thought (CoT) Evaluation strategy for fine-grained assessment of output answers.

README:

MathVerse πŸ”₯: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

MathQA Mathematical Reasoning Multi-Modal

ChatGPT GPT-4 GPT-4V Gemini

Official repository for the paper "MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?".

🌟 For more details, please refer to the project page with dataset exploration and visualization tools: https://mathverse-cuhk.github.io/.

[🌐 Webpage] [πŸ“– Paper] [πŸ€— Huggingface Dataset] [πŸ† Leaderboard] [πŸ” Visualization]

πŸ’₯ News

πŸ“Œ ToDo

  • Coming soon: CoT Evaluation results & tools, and the full MathVerse dataset

πŸ‘€ About MathVerse

The capabilities of Multi-modal Large Language Models (MLLMs) in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams.


To this end, we introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into six distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning.


Six different versions of each problem in MathVerse transformed by expert annotators.

In addition, we propose a Chain-of-Thought (CoT) Evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging True or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then score each step with detailed error analysis, which can reveal the intermediate CoT reasoning quality by MLLMs.


The two phases of the CoT evaluation strategy.

πŸ† Leaderboard

Contributing to the testmini Leaderboard

🚨 The Leaderboard for the testmini set is continuously being updated, welcoming the contribution of your excellent MLLMs! Currently, we regard the 'w/o' scores without the CoT evaluation as the primary metric in MathVerse, which is more cost-effective and saves time.

Data Usage

We release the testmini set of MathVerse for benchmarking on the leaderboard, which contains 788 visual math problems within two json files:

  • testmini.json: 788*5 test samples for five main versions to calculate the overall score, i.e., Text Dominant/Lite and Vision Intensive/Dominant/Only.
  • testmini_text_only.json: 788*1 test samples for Text Only to ablate the visual diagram understanding capacity.

You can download the dataset from the πŸ€— Huggingface by the following command (make sure that you have installed related packages):

from datasets import load_dataset

dataset = load_dataset("AI4Math/MathVerse", "testmini")
dataset_text_only = load_dataset("AI4Math/MathVerse", "testmini_text_only")

Here are some examples of how to access the downloaded dataset:

# print the first example on the testmini set
print(dataset["testmini"][0])
print(dataset["testmini"][0]['sample_index'])  # print the test sample id
print(dataset["testmini"][0]['problem_index'])  # print the unique problem id
print(dataset["testmini"][0]['problem_version'])  # print the problem version
print(dataset["testmini"][0]['question'])  # print the question text
print(dataset["testmini"][0]['query'])  # print the question query
print(dataset["testmini"][0]['answer'])  # print the answer
print(dataset["testmini"][0]['query_wo'])  # the input query for w/o scores
print(dataset["testmini"][0]['query_cot'])  # the input query for CoT evaluation scores
dataset["testmini"][0]['image']  # display the image

# print the first text-only example within the testmini set
print(dataset_text_only["testmini_text_only"][0])

We also provide the images in the PNG format. You can download and unzip them using the following commands:

cd data
wget https://huggingface.co/datasets/AI4Math/MathVerse/resolve/main/images.zip
unzip images.zip && rm images.zip

This step might be optional if you prefer to use the Hugging Face format of the data.

Inference

First, please refer to the following two templates to prepare your result json files.

If you expect to evaluate the 'w/o' scores in the leaderboard, please adopt query_wo as the input for MLLMs, which prompts the model to output a direct answer. For CoT evaluation, we can utilize query_cot that motivates MLLMs to provide a step-by-step reasoning process. You are also encouraged to tune the optimal prompt for your own model.

Evaluation

Then, we provide the code to derive the 'w/o' scores on the leaderboard, which requires advanced LLMs (e.g., ChatGPT/GPT-4, or Qwen-Max) to extract and match answers. The code 'CoT-E' scores will be released soon.

There are two steps for the evaluation of 'w/o' scores, where we prompt the ChatGPT/GPT-4 API as an example:

Step 1: Answer Extraction

pip install openai
cd evaluation

python extract_answer_s1.py \
--model_output_file PATH_TO_OUTPUT_FILE \
--save_file PATH_TO_ENTRACTION_FILE \
--cache \
--trunk_response 30 \
--save_every 10 \
--api_key GPT_API

Note that, step 1 is optional if your MLLM can directly output a clean answer for scoring.

Step2: Answer Scoring

python score_answer_s2.py \
--answer_extraction_file PATH_TO_ENTRACTION_FILE \
--save_file PATH_TO_SCORE_FILE \
--cache \
--trunk_response 30 \
--save_every 10 \
--api_key GPT_API

Note that, we recommend using ChatGPT/GPT-4 API for step 2 by default. By adding --quick_match in the command above, we also support a direct string matching between extracted answers and ground truths, which is faster but not accurate enough.

πŸ“ Dataset Examples

πŸ–± Click to expand the examples for six problem versions within three subjects

πŸ” Plane Geometry


πŸ” Solid Geometry


πŸ” Functions


πŸ“ˆ Evaluation Results

Coming soon!

βœ… Citation

If you find MathVerse useful for your research and applications, please kindly cite using this BibTeX:

@article{zhang2024mathverse,
  title={MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?},
  author={Zhang, Renrui and Jiang, Dongzhi and Zhang, Yichi and Lin, Haokun and Guo, Ziyu and Qiu, Pengshuo and Zhou, Aojun and Lu, Pan and Chang, Kai-Wei and Gao, Peng and others},
  journal={arXiv preprint arXiv:2403.14624},
  year={2024}
}

🧠 Related Work

Explore our additional research on Vision-Language Large Models, focusing on multi-modal LLMs and mathematical reasoning:

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for MathVerse

Similar Open Source Tools

For similar tasks

For similar jobs