Abstract of the Talk
In this talk, we will introduce Vision and Language (VL) models which can very well say if an image and text are related and answer questions about images. While performance on these tasks is important, task-centered evaluation does not tell us why they are so good at these tasks, such as what are the fine-grained linguistic capabilities of VL models use when solving them. Therefore, we present our work on the VALSE💃 benchmark to test six specific linguistic phenomena grounded in images. Our zero-shot experiments with five widely-used pretrained VL models suggest that current VL models have considerable difficulty addressing most phenomena.
In the second part, we ask how much a VL model uses the image and text modality in each sample or dataset. To measure the contribution of each modality in a VL model, we developed MM-SHAP which we applied in two ways: (1) to compare VL models for their average degree of multimodality, and (2) to measure for individual models the contribution of individual modalities for different tasks and datasets. Experiments with six VL models on four VL tasks highlight that unimodal collapse can occur to different degrees and in different directions, contradicting the wide-spread assumption that unimodal collapse is one-sided.
Hope to see you there on November 28th 2023 @ 4pm!