WikiCaps: A Multilingual Dataset of User-generated Captions

WikiCaps is a large-scale multilingual but non-parallel data set for multimodal machine translation and retrieval. The image-caption data was extracted from Wikimedia Commons and is thus a representative of the collection of largely available non-descriptive image-caption pairs in the web. The current version of the dataset contains 3,816,940 images with 3,825,132 English captions and additional 1,000 image-caption pairs in German, French, and Russian together with their English counterparts.

Terms of Use

The textual part of WikiCaps is licensed under a Creative Commons BY-SA 4.0 Unported License. Creative Commons License.

The visual part of WikiCaps is protected under different licenses by the original authors. We thus included a script for downloading the images directly from Wikimedia Commons.

If you use the corpus in your work, please cite: (Schamoni et al., 2018)

Data

The corpus contains image-caption pairs for the English retrieval part, and image-caption pairs for dev and test, with parallel captions in German, French, and Russian and their English counterparts.

The image-caption data was retrieved from Wikimedia Commons. For space and processing efficiency, images were resized to a minimum of 256 pixels (width or height) preserving the original aspect ratio.

For a more detailed description of the corpus construction process, see the above publication and consult the README in the download archive.

	#images	#captions	language(s)
retrieval	3,816,940	3,825,132	English
dev	1,000	1,000	German–English
test	999	999	German–English
dev	999	999	French–English
test	1,000	1,000	French–English
dev	1,000	1,000	Russian–English
test	1,000	1,000	Russian–English

Table 1. Statistics

Format

There are three types of data files:

Monolingual retrieval data (img_en)
Bilingual dev and test data (.dev or .test file)
Images list (.lst file)

The format of the img_en file for retrieval is:

image-filename [TAB] English-caption

The format of a bilingual .dev and .test files is:

image-filename [TAB] Foreign-caption ||| English-caption

The images lists .lst contain an image filename on each line as input for wikimgrab.pl (see download archive):

image-filename

Download

wikicaps_v1.0.tar.gz (v1.0, 02/13/2018, 427MB, md5: 47a3aa5cf64f70aced556f1751faedba)

Publication

Shigehiko Schamoni, Julian Hitschler and Stefan Riezler

A Dataset and Reranking Method for Multimodal MT of User-Generated Image Captions

Proceedings of the 13th biennial conference of the Association for Machine Translation in the Americas (AMTA), Boston, MA, USA, 2018

pdf | bib

@inproceedings{schamoni2018,
  author = {Schamoni, Shigehiko and Hitschler, Julian and Riezler, Stefan},
  title = {A Dataset and Reranking Method for Multimodal MT of User-Generated Image Captions},
  journal = {Proceedings of the 13th biennial conference of the Association for Machine Translation in the Americas},
  journal-abbrev = {AMTA},
  year = {2018},
  city = {Boston, MA},
  country = {USA},
  url = {http://www.cl.uni-heidelberg.de/~riezler/publications/papers/AMTA2018.1.pdf}
}