In "Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?" we investigate the reliance of modern Vision & Language Models (VLMs) on image๐ผ๏ธ vs. text๐ inputs when generating answers vs. explanations, revealing fascinating insights into their modality use and self-consistency.
The key insights are:
(1)๐We measure how much VLMs use text and images when generating predictions or explanations. ๐ฏ We find that VLMs are heavily text-centric when producing answers and natural language explanations.
(2)๐We evaluate VLMs' self-consistency when generating post-hoc and CoT explanations. ๐ฏ Most VLMs are less self-consistent than LLMs. For all models, the contributions of the image are significantly stronger when generating explanations compared to answers.
(3)๐We provide an update of the accuracies reached by state-of-the-art VLMs on the VALSE ๐ benchmark ๐ฏ Even modern VLMs still struggle with most phenomena tested by VALSE๐, although there are strong improvements from models such as mplug-owl3.
Congratulations the authors Letitia Pracalabescu and Anette Frank.