WikiCaps: A Multilingual Dataset of User-generated Captions

WikiCaps is a large-scale multilingual but non-parallel data set for multimodal machine translation and retrieval. The image-caption data was extracted from Wikimedia Commons and is thus a representative of the collection of largely available non-descriptive image-caption pairs in the web. The current version of the dataset contains 3,816,940 images with 3,825,132 English captions and additional 1,000 image-caption pairs in German, French, and Russian together with their English counterparts.

Terms of Use

The textual part of WikiCaps is licensed under a Creative Commons BY-SA 4.0 Unported License. Creative Commons License.

The visual part of WikiCaps is protected under different licenses by the original authors. We thus included a script for downloading the images directly from Wikimedia Commons.

If you use the corpus in your work, please cite: (Schamoni, Hitschler, & Riezler, 2018)

Data

The corpus contains image-caption pairs for the English retrieval part, and image-caption pairs for dev and test, with parallel captions in German, French, and Russian and their English counterparts.

The image-caption data was retrieved from Wikimedia Commons. For space and processing efficiency, images were resized to a minimum of 256 pixels (width or height) preserving the original aspect ratio.

For a more detailed description of the corpus construction process, see the above publication and consult the README in the download archive.

  #images #captions language(s)
retrieval 3,816,940 3,825,132 English
dev 1,000 1,000 German–English
test 999 999 German–English
dev 999 999 French–English
test 1,000 1,000 French–English
dev 1,000 1,000 Russian–English
test 1,000 1,000 Russian–English

Format

There are three types of data files:

  1. Monolingual retrieval data (img_en)
  2. Bilingual dev and test data (.dev or .test file)
  3. Images list (.lst file)

The format of the img_en file for retrieval is:

image-filename [TAB] English-caption

The format of a bilingual .dev and .test files is:

image-filename [TAB] Foreign-caption ||| English-caption

The images lists .lst contain an image filename on each line as input for wikimgrab.pl (see download archive):

image-filename

Download

wikicaps_v1.0.tar.gz (v1.0, 02/13/2018, 427MB, md5: 47a3aa5cf64f70aced556f1751faedba)

Publication

  1. Shigehiko Schamoni, Julian Hitschler and Stefan Riezler
    A Dataset and Reranking Method for Multimodal MT of User-Generated Image Captions
    Proceedings of the 13th biennial conference of the Association for Machine Translation in the Americas (AMTA), Boston, MA, USA, 2018
    @inproceedings{schamoni2018,
      author = {Schamoni, Shigehiko and Hitschler, Julian and Riezler, Stefan},
      title = {A Dataset and Reranking Method for Multimodal MT of User-Generated Image Captions},
      journal = {Proceedings of the 13th biennial conference of the Association for Machine Translation in the Americas},
      journal-abbrev = {AMTA},
      year = {2018},
      city = {Boston, MA},
      country = {USA},
      url = {http://www.cl.uni-heidelberg.de/~riezler/publications/papers/AMTA2018.1.pdf}
    }