- Causal annotations for German verbs, nouns and prepositions.
- chomskyDisc, 60,000 tokens of talks, articles and interviews by/with Noam Chomsky, with 1,612 annotations of potentially causal/concessive/contrastive discourse connectives.
- tweeDe, the German UD Twitter treebank, with >12,000 tokens from 519 tweets
- A harmonised testsuite for POS tagging of German social media data
- CQA_de A German corpus for Community Question Answering, with 9,300 questions
and >50,000 answers extracted from Stackexchange and Reddit
- The KiezDeutsch-Korpus (KiDKo)
From 2011 to 2014 I was a Postdoc in Project B2, SFB 632 "Information Structure" in Potsdam where we built a corpus of Kiez-German, an informal variety of German spoken by adolescents from a multi-ethnic neighborhood.
- Modifiers in TIGER
With Hagen Hirschmann, we augmented the first 10.000 sentences in the TIGER treebank with more fine-grained, syntactically motivated parts-of-speech for German modifiers. Our annotation scheme as well as parsing experiments using the new tagset are described in Rehbein and Hirschmann (KONVENS 2014, TLT 2014).
Together with my colleague, Josef Ruppenhofer, we've developed an annotation scheme for English modal verbs and provide word-sense and frame role annotations for the instances of 5 modal verbs in the MPQA corpus.
- The SALSA corpus
From 2008 to 2010 I was part of the SALSA project at Saarland University where we created a large, frame-based lexicon for German, with rich semantic and syntactic properties, as a resource for linguistic and computational linguistic research. The Salsa corpus is freely available for research purposes.
- Python scripts that convert function-head style encodings in dependency treebanks in a content-head style encoding (as used in the UD treebanks) and vice versa (for adpositions, copula and coordination). For more information, see our DepLing paper (Rehbein, Steen, Do & Frank 2017)
- With Josef Ruppenhofer and Julius Steen, we developed a method for detecting noise in automatically annotated sequence-labelled data, combining MACE (Hovy et al. 2014) with Active Learning.
The source code and a simple annotation interface is available for download. Many thanks to Julius Steen for restructuring the code and adding the gui!
- We extended our method for error detection in treebanks, as described in our COLING 2018 paper "Sprucing up the trees -- Error detection in treebanks".
Here you can download the source code for MACE-AL-TREE.
With our students, Marcel and Jonas, we developed MaJo, a toolkit for Word Sense Disambiguation and Active Learning.