vision-llms-are-blind

vision-llms-are-blind

None

Stars: 96

Visit
 screenshot

This repository contains the code and data for the paper 'Vision Language Models Are Blind'. It explores the limitations of large language models with vision capabilities (VLMs) in performing basic visual tasks that are easy for humans. The repository presents benchmark results showcasing the poor performance of state-of-the-art VLMs on tasks like counting line intersections, identifying circles, letters, and shapes, and following color-coded paths. The research highlights the challenges faced by VLMs in understanding visual information accurately, drawing parallels to myopia and blindness in human vision.

README:

Vision Language Models Are Blind

by Pooyan Rahmanzadehgervi1,*, Logan Bolton1,*, Mohammad Reza Taesiri2, Anh Totti Nguyen1

*Equal contribution
1Auburn University, 2University of Alberta

Website arXiv Hugging Face Dataset

This repository contains the code and data for the paper Vision Language Models Are Blind.

@article{vlms2024blind,
  title={Vision language models are blind},
  author={Rahmanzadehgervi, Pooyan and Bolton, Logan and Taesiri, Mohammad Reza and Nguyen, Anh Totti},
  journal={arXiv preprint arXiv:2407.06581},
  year={2024}
}

Abstract

While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro, are powering various image-text applications and scoring high on many vision-understanding benchmarks, we find that they are surprisingly still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.12% accurate on average. Claude 3.5 Sonnet performs the best at 74.94% accuracy, but this is still far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs consistently struggle with tasks that require precise spatial information and recognizing geometric primitives that overlap or are close together. Code and data are available at: https://vlmsareblind.github.io

How to Reproduce Results

  1. Find images in src/{task} directory. For example: this image in the gpt-4o/incorrect folder.

  2. Locate corresponding prompts in prompts.md. For example: Are the two circles touching each other? Answer with Yes/No.

  3. Input the above input image and prompt to models via default API settings or official playground, NOT using their web interface (e.g. use https://platform.openai.com/playground/chat for GPT-4o)

  4. Compare your results with our paper, noting that variations may occur due to the default temperature = 1 setting.

Important: Using the web interface (e.g., chatgpt.com) of the models may result in very different results from our paper.

Tasks in the BlindTest benchmark

  1. Task 1: Counting Line Intersection
  2. Task 2: Two Circles
  3. Task 3: Circled Letter
  4. Task 4: Counting Circles
  5. Task 5: Counting Nested Squares
  6. Task 6: Counting Rows and Columns
  7. Task 7: Following color-coded paths

Benchmark Results

Mean Accuracy - All Tasks

vision-llms-are-blind

vision-llms-are-blind

vision-llms-are-blind

vision-llms-are-blind

vision-llms-are-blind

vision-llms-are-blind

vision-llms-are-blind

vision-llms-are-blind

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for vision-llms-are-blind

Similar Open Source Tools

For similar tasks

For similar jobs