Ruprecht-Karls-Universität Heidelberg

WikiCaps: A Multilingual Dataset of User-generated Captions

WikiCaps is a large-scale multilingual but non-parallel data set for multimodal machine translation and retrieval. The image-caption data was extracted from Wikimedia Commons and is thus a representative of the collection of largely available non-descriptive image-caption pairs in the web. The current version of the dataset contains 3,816,940 images with 3,825,132 English captions and additional 1,000 image-caption pairs in German, French, and Russian together with their English counterparts.

Terms of Use

The textual part of WikiCaps is licensed under a Creative Commons BY-SA 4.0 Unported License. Creative Commons License

The visual part of WikiCaps is protected under different licenses by the original authors. We thus included a script for downloading the images directly from Wikimedia Commons.

If you use the corpus in your work, please cite:

Shigehiko Schamoni, Julian Hitschler, Stefan Riezler. "A Dataset and Reranking Method for Multimodal MT of User-Generated Image Captions". In Proceedings of the Association for Machine Translation in the Americas (AMTA 2018), Boston, MA, USA. (pdf, bib)

Data

The corpus contains image-caption pairs for the English retrieval part, and image-caption pairs for dev and test, with parallel captions in German, French, and Russian and their English counterparts.

The image-caption data was retrieved from Wikimedia Commons. For space and processing efficiency, images were resized to a minimum of 256 pixels (width or height) preserving the original aspect ratio.

For a more detailed description of the corpus construction process, see the above publication and consult the README in the download archive.


Statistics#images#captionslanguage(s)
retrieval3,816,9403,825,132English
dev1,0001,000German–English
test999999German–English
dev999999French–English
test1,0001,000French–English
dev1,0001,000Russian–English
test1,0001,000Russian–English

Format

There are three types of data files:
  1. Monolingual retrieval data (img_en)
  2. Bilingual dev and test data (.dev or .test file)
  3. Images list (.lst file)
The format of the img_en file for retrieval is:
image-filename [TAB] English-caption
The format of a bilingual .dev and .test files is:
image-filename [TAB] Foreign-caption ||| English-caption
The images lists .lst contain an image filename on each line as input for wikimgrab.pl (see download archive):
image-filename

Downloads

wikicaps_v1.0.tar.gz (v1.0, 02/13/2018, 427MB, md5)

zum Seitenanfang