[an error occurred while processing this directive] [an error occurred while processing this directive]

Statistical Natural Language Processing Group

Resources & Corpora

BoostCLIR: A Japanese-English corpus of patent abstracts for patent prior art search, consisting of 100K queries and relevance judgements for 1.4M documents.

DeCOCO: German translations for 1000 image captions from the COCO dataset.

HumanMT: Human pairwise and five-point ratings for 1000 translations from German to English.

NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval, extracted from NutritionFacts.org.

NLmaps: A corpus for question-answering, consisting of 2,380 questions in English and German with corresponding Machine Readable Language (MRL) formulae, using the OpenStreetMap database.

PatTR: A parallel patent corpus for statistical machine translation featuring three language pairs, German-English (23M sentences pairs), English-French (19M sentence pairs) and French-German (5M sentence pairs).

WikiCaps: A large-scale multilingual data set of image-caption pairs for multimodal machine translation, extracted from Wikimedia Commons.

WikiCLIR: A large-scale German-English retrieval data set for Cross-Language Information Retrieval, extracted from Wikipedia.

[an error occurred while processing this directive]