
Resources
Data sets
- Causal annotations for German verbs, nouns and prepositions.
- chomskyDisc, 60,000 tokens of talks, articles and interviews by/with Noam Chomsky, with 1,612 annotations of potentially causal/concessive/contrastive discourse connectives.
- tweeDe, the German UD Twitter treebank, with >12,000 tokens from 519 tweets
- A harmonised testsuite for POS tagging of German social media data
- CQA_de A German corpus for Community Question Answering, with 9,300 questions
and >50,000 answers extracted from Stackexchange and Reddit
- The KiezDeutsch-Korpus (KiDKo)
From 2011 to 2014 I was a Postdoc in Project B2, SFB 632 "Information Structure" in Potsdam where we built a corpus of Kiez-German, an informal variety of German spoken by adolescents from a multi-ethnic neighborhood.
- Modifiers in TIGER
With Hagen Hirschmann, we augmented the first 10.000 sentences in the TIGER treebank with more fine-grained, syntactically motivated parts-of-speech for German modifiers. Our annotation scheme as well as parsing experiments using the new tagset are described in Rehbein and Hirschmann (KONVENS 2014, TLT 2014).
- Modalia
Together with my colleague, Josef Ruppenhofer, we've developed an annotation scheme for English modal verbs and provide word-sense and frame role annotations for the instances of 5 modal verbs in the MPQA corpus.
- The SALSA corpus
From 2008 to 2010 I was part of the SALSA project at Saarland University where we created a large, frame-based lexicon for German, with rich semantic and syntactic properties, as a resource for linguistic and computational linguistic research. The Salsa corpus is freely available for research purposes.
Other resources
- Python scripts that convert function-head style encodings in dependency treebanks in a content-head style encoding (as used in the UD treebanks) and vice versa (for adpositions, copula and coordination). For more information, see our DepLing paper (Rehbein, Steen, Do & Frank 2017)
- With Josef Ruppenhofer and Julius Steen, we developed a method for detecting noise in automatically annotated sequence-labelled data, combining MACE (Hovy et al. 2014) with Active Learning.
The source code and a simple annotation interface is available for download. Many thanks to Julius Steen for restructuring the code and adding the gui!
- We extended our method for error detection in treebanks, as described in our COLING 2018 paper "Sprucing up the trees -- Error detection in treebanks".
Here you can download the source code for MACE-AL-TREE.
- MaJo
With our students, Marcel and Jonas, we developed MaJo, a toolkit for Word Sense Disambiguation and Active Learning.
