I moved to the Max Planck Institute for Empirical Aesthetics (MPIEA), where I am working on the computational modelling of parallelistic text features and poetic licence. This includes curating a corpus of German poetry and the extraction of cohesive characteristics on multiple linguistic layers.
Gutenberg Literary Genre
Gutenberg_lit is a English text genre corpus sampled from gutenberg.org consisting of three genre: Poetry, Drama and Fiction, of 500 documents each (totalling 1500 documents), divided into a training set (350), a test set (50), a devA- set (50), and a devA+ set (50). The dev sets were designed to represent sets with authors that are representative of the respective genre (A+), and a set where authors are not present in the training set (A-).
The documents were filtered with regular expressions on the subject line and the title to ensure that the classes actually contain literature and not texts about literature (such as commentaries) or something else entirely (inauguration speeches).
All documents are preprocessed with the Stanford CoreNLP pipeline and are available as exportXML (exml). The annotation layers include:
- Unique Id, Author, Title (In the filename)
- Tokenization & Lemma
- Part-of-speech (penn)
- Named Entities (11 classes)
- Leidenfrost-Burth, L., Haider, T., Woellstein, A. (2015) Rechtschreibwortschatz für Erwachsene, Winter Verlag Heidelberg
- Bingel, J. and Haider, T. (2014) Named Entity Tagging a Very Large Unbalanced Corpus: Training and Evaluating NE Classifiers. Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., and Piperidis, S. (eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation, Reykjavik, Iceland, may, European Language Resources Association (ELRA). [ bib | .pdf ]