WikiCaps: A Multilingual Dataset of User-generated Captions
WikiCaps is a large-scale multilingual but non-parallel data set for multimodal machine translation and retrieval. The image-caption data was extracted from Wikimedia Commons and is thus a representative of the collection of largely available non-descriptive image-caption pairs in the web. The current version of the dataset contains 3,816,940 images with 3,825,132 English captions and additional 1,000 image-caption pairs in German, French, and Russian together with their English counterparts.
The textual part of WikiCaps is licensed under a Creative Commons BY-SA 4.0 Unported License.
The visual part of WikiCaps is protected under different licenses by the original authors. We thus included a script for downloading the images directly from Wikimedia Commons.
If you use the corpus in your work, please cite:
Shigehiko Schamoni, Julian Hitschler, Stefan Riezler. "A Dataset and Reranking Method for Multimodal MT of User-Generated Image Captions". In Proceedings of the Association for Machine Translation in the Americas (AMTA 2018), Boston, MA, USA. (pdf, bib)
The corpus contains image-caption pairs for the English retrieval part, and image-caption pairs for dev and test, with parallel captions in German, French, and Russian and their English counterparts.
The image-caption data was retrieved from Wikimedia Commons. For space and processing efficiency, images were resized to a minimum of 256 pixels (width or height) preserving the original aspect ratio.
For a more detailed description of the corpus construction process, see the above publication and consult the README in the download archive.
FormatThere are three types of data files:
- Monolingual retrieval data (img_en)
- Bilingual dev and test data (.dev or .test file)
- Images list (.lst file)
image-filename [TAB] English-captionThe format of a bilingual .dev and .test files is:
image-filename [TAB] Foreign-caption ||| English-captionThe images lists .lst contain an image filename on each line as input for wikimgrab.pl (see download archive):