Ruprecht-Karls-Universität Heidelberg

Statistical Natural Language Processing Group

Resources & Corpora

  • BoostCLIR: A Japanese-English corpus of patent abstracts for patent prior art search, consisting of 100K queries and relevance judgements for 1.4M documents.

  • DeCOCO: German translations for 1000 image captions from the COCO dataset.

  • NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval, extracted from

  • NLmaps: A corpus for question-answering, consisting of 2,380 questions in English and German with corresponding Machine Readable Language (MRL) formulae, using the OpenStreetMap database.

  • PatTR: A parallel patent corpus for statistical machine translation featuring three language pairs, German-English (23M sentences pairs), English-French (19M sentence pairs) and French-German (5M sentence pairs).

  • WikiCLIR: A large-scale German-English retrieval data set for Cross-Language Information Retrieval, extracted from Wikipedia.

zum Seitenanfang