blog thumbnail

How difficult is matching images and text? – An investigation

Siting Liang

Ordinarily, we call the channels of communication and sensation modalities. We experience the world involving multiple modalities, such as vision, hearing or touch. A system which handles dataset includes multiple modalities is characterized as a multi-modality modal. For example, MSCOCO dataset contains not only the images, but also the language captions for each image. Utilizing such data, a model can learn to bridge the language and vision modalities.