Abstract
Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro are powering countless image-text applications and scoring high on many vision-understanding benchmarks. Yet, we find that VLMs fail on 7 visual tasks absurdly easy to humans such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting the number of circles in a Olympic-like logo. The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like of a person with myopia seeing fine details as blurry, and at worst, like an intelligent person that is blind making educated guesses. Code is available at: https://vlmsareblind.github.io/
Community
This paper examines the limitations of current vision-based language models, such as GPT-4 and Sonnet 3.5, in performing low-level vision tasks. Despite their high scores on numerous multimodal benchmarks, these models often fail on very basic cases. This raises a crucial question: are we evaluating these models accurately?
Wonderful! I'm so glad to see these flaws being pointed out in a paper! Thank toy for your work on this!
Such clickbait titles on papers I do find them honestly quite cringe, in particular with models like Claude 3.5 Sonnet that performs much better than random on almost all the test.
Also the comparison between AI vision and myopia makes no sense as these images are not evaluating eyesight but the abstract capabilities of the models.
Thank you for your feedback!
I agree that blindness and myopia are the terms defined for human vision. How AIs "see" the image is different from how humans see.
FYI. Dhruv Batra recently also called VLMs "nearly blind".
https://x.com/DhruvBatraDB/status/1778447178262040850
Perhaps we all should use AI blindness
to avoid misinterpretation.
interesting read;
could this account, in part, account for the substandard lip-reading capability of LLMs when applied to speech recognition...?
What's the state of the art with accurate lip-reading llm application? Anyone?
regarding C1, it would be nice if more than 150 samples would have been used since they are pretty easy to generate anyway. like 1k images per model or something like that to better filter out some statistical noise. otherwise a difference of 1-3 points can just as well be random fluctuation (specifically looking at gemini-1.5 vs sonnet-3 in C2).
Thank you for your feedback!
We're re-generating the images (large sample size and less ambiguity) for the "counting two-line intersections" task in light of your suggestions and will update the paper.
Wow, so captcha still has some chances against AIs... This is good.
Overlapping circles everyone! :)
This is very interesting. I recently wrote a short post on my thoughts of the comparison of VLMs and how the brain works. My conclusion was that since these models lack the recurrence and bidirectionality that the brain has, the processing of VLMs is similar to the feedforward pre-attentive processing that has also been studied in the neuroscience and psychology literature. Another way to think of it is VLMs currently have to produce encodings for any relevant task, while our brains can interact with the visual hierarchy to produce "encodings" that a relevant for the task at hand. https://medium.com/towards-data-science/clip-llava-and-the-brain-2073dfb33d7e
I've tried some images from the vision sciences on VLMs and found that they do struggle on some of the images, but not as reliably as your result.
Thank you for your article! Interesting!
Your feedback/recurrence point is indeed very similar to my hypothesis here:
https://x.com/anh_ng8/status/1813311161754144905
IMO, a high-level problem is that the granularity of visual representations extracted or model visual attention should be based on the prompt. Yet, most open-source models first extract visual representations without using the prompt and then fuse them with the text tokens.
Very interesting. This article profoundly counters the fervent arms race of multimodal model development. What kind of multimodal model do we truly need?
Recently, we also published a paper (Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model) addressing similar issues. We proposed the concept of abstract images: while current multimodal models perform well on conventional semantically related images, their understanding of abstract images, such as clocks, maps, layouts, and flowcharts, remains very rudimentary. Therefore, we constructed a large abstract image benchmark through self-instruct and code, and evaluated the current multimodal models. Our results are similar to those in this article, showing that even the most advanced multimodal models fail at some very simple tasks.
https://arxiv.org/abs/2407.07053
Code: https://github.com/zwq2018/Multi-modal-Self-instruct
Our Leaderboard: https://multi-modal-self-instruct.github.io/
Our Dataset: https://huggingface.co/datasets/zwq2018/Multi-modal-Self-instruct
Thank you for sharing!
This very interesting and relevant work! :)
Interesting paper. VLMs have a long way to go and I'm glad there's on-the-record research documenting this now.
Unfortunately, it is TOTALLY unnecessary to use real-life human disabilities to critique a VLM negatively - it doesn't add any descriptive power and reinforces ableist stereotypes.
Thank you for your feedback!
We had simply thought human vision <> computer vision, and, human blindness <> computer blindness.
Regardless, we've now removed a lot of mentions of "myopia" and "blindness" in the paper.
Models citing this paper 0
No model linking this paper