Ruprecht-Karls-Universität Heidelberg

A harmonised testsuite for POS Tagging
of German social media data

Data sets

A harmonised POS testsuite of web data, CMC and Twitter microtext

  • download the harmonised dataset (56,572 tokens)
  • with word forms and STTS pos tags (+ some additional CMC-specific tags)
  • UD pos tags have been automatically converted, based on the STTS pos tags
  • no lemma information (or, for parts of the data, automatically predicted lemmas)

The original data comes from 3 different sources:

If you only want to download the new Twitter testsuite:

tweeDe (POS)

  • the tweeDe dataset, split up into train/dev/test set
  • 56,572 tokens

POS tagging models for German social media

Results

text type HunPos Online-Flors biLSTM-char-CRF
TiGer 96.48
web 93.73
cmc 89.49
twitter (GSCL) 91.23
tweeDe 93.10

zum Seitenanfang