Ruprecht-Karls-Universität Heidelberg


Data sets

  • Causal annotations for German verbs, nouns and prepositions.
  • chomskyDisc, 60,000 tokens of talks, articles and interviews by/with Noam Chomsky, with 1,612 annotations of potentially causal/concessive/contrastive discourse connectives.
  • tweeDe, the German UD Twitter treebank, with >12,000 tokens from 519 tweets
  • A harmonised testsuite for POS tagging of German social media data
  • CQA_de A German corpus for Community Question Answering, with 9,300 questions
    and >50,000 answers extracted from Stackexchange and Reddit
  • The KiezDeutsch-Korpus (KiDKo)
    From 2011 to 2014 I was a Postdoc in Project B2, SFB 632 "Information Structure" in Potsdam where we built a corpus of Kiez-German, an informal variety of German spoken by adolescents from a multi-ethnic neighborhood.
  • Modifiers in TIGER
    With Hagen Hirschmann, we augmented the first 10.000 sentences in the TIGER treebank with more fine-grained, syntactically motivated parts-of-speech for German modifiers. Our annotation scheme as well as parsing experiments using the new tagset are described in Rehbein and Hirschmann (KONVENS 2014, TLT 2014).
  • Modalia
    Together with my colleague, Josef Ruppenhofer, we've developed an annotation scheme for English modal verbs and provide word-sense and frame role annotations for the instances of 5 modal verbs in the MPQA corpus.
  • The SALSA corpus
    From 2008 to 2010 I was part of the SALSA project at Saarland University where we created a large, frame-based lexicon for German, with rich semantic and syntactic properties, as a resource for linguistic and computational linguistic research. The Salsa corpus is freely available for research purposes.

Other resources

zum Seitenanfang