Ruprecht-Karls-Universität Heidelberg

A harmonised testsuite for POS Tagging
of German social media data

Data sets

A harmonised POS testsuite of web data, CMC and Twitter microtext

  • download the harmonised dataset (56,572 tokens)
  • with word forms and STTS pos tags (+ some additional CMC-specific tags)
  • UD pos tags have been automatically converted, based on the STTS pos tags
  • no lemma information (or, for parts of the data, automatically predicted lemmas)

The original data comes from 3 different sources:

If you only want to download the new Twitter testsuite:

tweeDe (POS)

  • the tweeDe dataset, split up into train/dev/test set
  • 56,572 tokens

POS tagging models for German social media

    HunPos model

  • pre-trained model for the HunPos tagger (Halácsy et al. 2007)
    (including a readme)
  • Bi-LSTM-char-crf model

  • pre-trained model for the biLSTM-CRF tagger (Reimers & Gurevych 2017)
    (including a readme)
  • Twitter SkipGram Embeddings

  • download from here

Results

text type HunPos biLSTM-char-CRF
TiGer 96.48 97.56
web 93.73 93.93
cmc 89.49 91.44
twitter (GSCL) 91.23 92.20
tweeDe 93.10 92.31

zum Seitenanfang