Ruprecht-Karls-Universität Heidelberg
Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg
Siegel der Uni Heidelberg

Institut für Computerlinguistik

Resources & Corpora and Software

Resources & Corpora

  • BoostCLIR: A Japanese-English corpus of patent abstracts for patent prior art search, consisting of 100K queries and relevance judgements for 1.4M documents.
  • DeCOCO: German translations for 1000 image captions from the COCO dataset.
  • GER_SET: Situation Entity Type labelled corpus for German [LiMo project]
  • GNVN: Two datasets for joint semantic predicate (GermaNet) and semantic role (VerbNet-style) annotation for German.
  • IKAT: Implied knowledge annotations in argumentative texts (NLDB 2017) [LiMo project]
  • MSC: Modal sense classification (MSC) dataset
  • MACE-AL: Detecting annotation noise in automatically labelled data (ACL 2017) [LiMo project]
  • NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval, extracted from
  • NLmaps: A corpus for question-answering, consisting of 2,380 questions in English and German with corresponding Machine Readable Language (MRL) formulae, using the OpenStreetMap database.
  • PatTR: A parallel patent corpus for statistical machine translation featuring three language pairs, German-English (23M sentences pairs), English-French (19M sentence pairs) and French-German (5M sentence pairs).
  • SR3de: Semantic Role Triple Dataset for German is a dataset with parallel PropBank-, VerbNet-, and FrameNet-style semantic role annotation on a portion of approx. 3000 instances of the CoNLL 2009 shared task German data.
  • WikiCLIR: A large-scale German-English retrieval data set for Cross-Language Information Retrieval, extracted from Wikipedia.


  • cclir: A cross-language information retrieval (CLIR) toolbox based on the cdec decoder, code package used in Bag-of-words Forced Decoding for Cross-Lingual Information Retrieval (Hieber and Riezler, ACL 2015), inter alia.
  • dtrain: A tuning method implemented for the cdec decoder, see Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT (Simianer, Riezler and Dyer, ACL 2012).
  • otedama: Preordering for Machine Translation.
  • rebol: A toolkit for grounded learning for statistical machine translation, as described in the ACL 2014 paper, Response-Based Learning for Grounded Machine Translation (Riezler, Simianer and Haas).
  • semparse: A semantic parser that treats the task as a monolingual SMT problem. The underyling SMT framework is the cdec decoder.

zum Seitenanfang