NEW PUBLICATION

March 3rd, 2025

In "Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?" we investigate the reliance of modern Vision & Language Models (VLMs) on image๐Ÿ–ผ๏ธ vs. text๐Ÿ“„ inputs when generating answers vs. explanations, revealing fascinating insights into their modality use and self-consistency.

The key insights are:

(1)๐Ÿ”ŽWe measure how much VLMs use text and images when generating predictions or explanations.
๐ŸŽฏ We find that VLMs are heavily text-centric when producing answers and natural language explanations.

(2)๐Ÿ”ŽWe evaluate VLMs' self-consistency when generating post-hoc and CoT explanations.
๐ŸŽฏ Most VLMs are less self-consistent than LLMs. For all models, the contributions of the image are significantly stronger when generating explanations compared to answers.

(3)๐Ÿ”ŽWe provide an update of the accuracies reached by state-of-the-art VLMs on the VALSE ๐Ÿ’ƒ benchmark
๐ŸŽฏ Even modern VLMs still struggle with most phenomena tested by VALSE๐Ÿ’ƒ, although there are strong improvements from models such as mplug-owl3.

Congratulations the authors Letitia Pracalabescu and Anette Frank.