Towards Multilingual Vision-and-Language Models

Speaker: Emanuele Bugliarello (University of Copenhagen)

Title: Towards Multilingual Vision-and-Language Models


There has been an explosive growth of vision-and-language architectures in the last few years. While these models have reached impressive performance on several tasks, they are usually trained on English captions paired with images from North America or Western Europe. In this talk, I will discuss the limitations of state-of-the-art vision-and-language models when evaluated on multilingual and multicultural data. First, I will introduce a new protocol to collect culturally relevant images and captions, which resulted in MaRVL, a vision-and-language reasoning dataset in five typologically diverse languages. Then, I will present IGLUE, the first benchmark that evaluates multilingual multimodal models for transfer learning across languages, modalities, and tasks. IGLUE brings together -- by both aggregating pre-existing datasets and creating new ones -- visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. Unlike text-only language models, current multimodal encoders struggle on both zero-shot and few-shot cross-lingual transfer setups.

