A harmonised testsuite for POS Tagging
of German social media data
Data sets
A harmonised POS testsuite of web data, CMC and Twitter microtext
- download the harmonised dataset (56,572 tokens)
- with word forms and STTS pos tags (+ some additional CMC-specific tags)
- UD pos tags have been automatically converted, based on the STTS pos tags
- no lemma information (or, for parts of the data, automatically predicted lemmas)
The original data comes from 3 different sources:
- a twitter dataset with 21,181 tokens (Rehbein 2013)
- two datasets from the Empirist shared task 2015
- web data (12,718 tokens)
- computer-mediated communication (10,505 tokens)
- tweeDe: a twitter dataset with 12,156 tokens (Rehbein, Ruppenhofer & Zimmermann 2018)
If you only want to download the new Twitter testsuite:
tweeDe (POS)
- the tweeDe dataset, split up into train/dev/test set
- 56,572 tokens
POS tagging models for German social media
- pre-trained model for the HunPos tagger (Halácsy et al. 2007)
(including a readme) - pre-trained model for the biLSTM-CRF tagger (Reimers & Gurevych 2017)
(including a readme) - download from here
HunPos model
Bi-LSTM-char-crf model
Twitter SkipGram Embeddings
Results
text type | HunPos | biLSTM-char-CRF |
---|---|---|
TiGer | 96.48 | 97.56 |
web | 93.73 | 93.93 |
cmc | 89.49 | 91.44 |
twitter (GSCL) | 91.23 | 92.20 |
tweeDe | 93.10 | 92.31 |