Ruprecht-Karls-Universität Heidelberg
Institut für Computerlinguistik

Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg

Integrating Vision and Language: Achievements and Challenges in Multimodal Machine Learning

Module Description

Course Module Abbreviation Credit Points
BA-2010 AS-FL 8 LP
BA-2010 AS-CL 8 LP
BA-2010[100%|75%] CS-CL 6 LP
BA-2010[50%] BS-CL 6 LP
BA-2010[25%] BS-AC, BS-FL 4 LP
Lecturer Letitia Parcalabescu
Module Type Proseminar / Hauptseminar
Language English
First Session 23.10.2019
Time and Place Wednesday, 16:15-17:45, INF 326 / SR 27 2. OG
End of Commitment Period 21.01.2020

Prerequisite for Participation

  • good knowledge of statistical methods, incl. neural networks
  • advanced BA students or MA students
  • basic understanding of computer vision
  • interest in the interdisciplinary field of NLP and Computer Vision


  • regular, active participation;
  • presentation
  • project, seminar paper or equivalent contributions to the seminar


Progress in artificial intelligence requires more than separate understanding of text and unrelated processing of other signals, e.g. image, sound. Multi-modal machine learning aims to handle a combination of different signal types and relate information from different modalities. In the seminar, we will study the latest machine learning techniques tackling the multimodal applications and datasets emerged in the last years. We will discuss the performance of state-of-the-art models and assess the shortcomings and challenges of current research. Topics include:

  • Visual Question Answering (VQA)
  • Visual Dialogue
  • Phrase Grounding
  • Visual-Textual Entailment
  • Scene Graph Generation
  • Multimodal Machine Translation

Module Overview


For the agenda and the respective materials, please check the protected Materials Webpage.


Literature will be provided by the beginning of the term. A survey:

  • Baltrušaitis, T., Ahuja, C. and Morency, L.P., 2018. Multimodal machine learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), pp.423-443.
  • Kafle, K., Shrestha, R. and Kanan, C., 2019. Challenges and Prospects in Vision and Language Research. arXiv preprint arXiv:1904.09317.
  • Schlangen, D., 2019. Natural Language Semantics With Pictures: Some Language & Vision Datasets and Potential Uses for Computational Semantics. arXiv preprint arXiv:1904.07318.

» More Materials

zum Seitenanfang