Using images to ground machine translation

Multi-modal Machine Translation (MMT) is a relatively new research topic only 
recently addressed by the Machine Translation (MT) research community in the 
form of a shared task. The practical goal of MMT is to build MT models that 
use image information to better translate image descriptions, i.e. by improving 
the translation of ambiguous terms that could in principle be disambiguated by 
an image (e.g., an image of a jaguar could probably disambiguate whether a 
certain mention "jaguar" means the car brand or the animal species). There 
are many different conceivable ways to extract visual information from images, 
as well as different MT architectures that one can incorporate visual 
information into. In this talk, I will discuss how to incorporate both global 
and local image features obtained with publicly available pre-trained 
Convolutional Neural Networks into the Attention-based Neural Machine 
Translation architecture.