Ruprecht-Karls-Universität Heidelberg

Institut für Computerlinguistik

A harmonised testsuite for POS Tagging
of German social media data

Data sets

A harmonised POS testsuite of web data, CMC and Twitter microtext

download the harmonised dataset (56,572 tokens)
with word forms and STTS pos tags (+ some additional CMC-specific tags)
UD pos tags have been automatically converted, based on the STTS pos tags
no lemma information (or, for parts of the data, automatically predicted lemmas)

The original data comes from 3 different sources:

a twitter dataset with 21,181 tokens (Rehbein 2013)
two datasets from the Empirist shared task 2015
- web data (12,718 tokens)
- computer-mediated communication (10,505 tokens)
tweeDe: a twitter dataset with 12,156 tokens (Rehbein, Ruppenhofer & Zimmermann 2018)

If you only want to download the new Twitter testsuite:

tweeDe (POS)

the tweeDe dataset, split up into train/dev/test set
56,572 tokens

POS tagging models for German social media

HunPos model

pre-trained model for the HunPos tagger (Halácsy et al. 2007)
(including a readme)

Bi-LSTM-char-crf model

pre-trained model for the biLSTM-CRF tagger (Reimers & Gurevych 2017)
(including a readme)

Twitter SkipGram Embeddings

download from here

Results

text type	HunPos	biLSTM-char-CRF
TiGer	96.48	97.56
web	93.73	93.93
cmc	89.49	91.44
twitter (GSCL)	91.23	92.20
tweeDe	93.10	92.31