Statistical Natural Language Processing Group
Resources & Corpora
- BoostCLIR: A Japanese-English corpus of patent abstracts for patent prior art search, consisting of 100K queries and relevance judgements for 1.4M documents.
- DeCOCO: German translations for 1000 image captions from the COCO dataset.
- HumanMT: Human pairwise and five-point ratings for 1000 translations from German to English.
- NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval, extracted from NutritionFacts.org.
- NLmaps: A corpus for question-answering, consisting of 2,380 questions in English and German with corresponding Machine Readable Language (MRL) formulae, using the OpenStreetMap database.
- PatTR: A parallel patent corpus for statistical machine translation featuring three language pairs, German-English (23M sentences pairs), English-French (19M sentence pairs) and French-German (5M sentence pairs).
- WikiCaps: A large-scale multilingual data set of image-caption pairs for multimodal machine translation, extracted from Wikimedia Commons.
- WikiCLIR: A large-scale German-English retrieval data set for Cross-Language Information Retrieval, extracted from Wikipedia.