HD NLP

In "Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?" we investigate the reliance of modern Vision & Language Models (VLMs) on image🖼️ vs. text📄 inputs when generating answers vs. explanations, revealing fascinating insights into their modality use and self-consistency.

The key insights are:

(1)🔎We measure how much VLMs use text and images when generating predictions or explanations.
🎯 We find that VLMs are heavily text-centric when producing answers and natural language explanations.

(2)🔎We evaluate VLMs' self-consistency when generating post-hoc and CoT explanations.
🎯 Most VLMs are less self-consistent than LLMs. For all models, the contributions of the image are significantly stronger when generating explanations compared to answers.

(3)🔎We provide an update of the accuracies reached by state-of-the-art VLMs on the VALSE 💃 benchmark
🎯 Even modern VLMs still struggle with most phenomena tested by VALSE💃, although there are strong improvements from models such as mplug-owl3.

Congratulations the authors Letitia Pracalabescu and Anette Frank.

HOME

NEWS

TEAM

RESEARCH

PROJECTS

BLOG

ARCHIVE

NEW PUBLICATION

TAGS

Meet us on