Ruprecht-Karls-Universität Heidelberg

Thomas Haider

Contact details
Emailthomas.haider <at>

Current occupation

I moved to the Max Planck Institute for Empirical Aesthetics (MPIEA), where I am working on the computational modelling of parallelistic text features and poetic licence. This includes curating a corpus of German poetry and the extraction of cohesive characteristics on multiple linguistic layers.


Gutenberg Literary Genre

Gutenberg_lit is a English text genre corpus sampled from consisting of three genre: Poetry, Drama and Fiction, of 500 documents each (totalling 1500 documents), divided into a training set (350), a test set (50), a devA- set (50), and a devA+ set (50). The dev sets were designed to represent sets with authors that are representative of the respective genre (A+), and a set where authors are not present in the training set (A-).

The documents were filtered with regular expressions on the subject line and the title to ensure that the classes actually contain literature and not texts about literature (such as commentaries) or something else entirely (inauguration speeches). All documents are preprocessed with the Stanford CoreNLP pipeline and are available as exportXML (exml). The annotation layers include:

  • Unique Id, Author, Title (In the filename)
  • Tokenization & Lemma
  • Sentences
  • Part-of-speech (penn)
  • Named Entities (11 classes)
The corpus includes a timespan from the late 18th century into the early 20th century. I can make it available to you upon request at thomas.haider .a. . For details and experiments on this dataset, please see my MA thesis.



  • Leidenfrost-Burth, L., Haider, T., Woellstein, A. (2015) Rechtschreibwortschatz für Erwachsene, Winter Verlag Heidelberg


  • Bingel, J. and Haider, T. (2014) Named Entity Tagging a Very Large Unbalanced Corpus: Training and Evaluating NE Classifiers. Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., and Piperidis, S. (eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation, Reykjavik, Iceland, may, European Language Resources Association (ELRA). [ bib | .pdf ]
You can find the serialized model for the Stanford NER system here, aside the models of Faruqui and Pado (2010) , which -- on our testset -- we outperform by a wide margin.


zum Seitenanfang