Ruprecht-Karls-Universität Heidelberg
Institut für Computerlinguistik

Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg

Problems with Data


Studiengang Modulkürzel Leistungs-
BA-2010[100%|75%] CS-CL 6 LP
BA-2010[50%] BS-CL 6 LP
BA-2010[25%] BS-AC, BS-FL 4 LP
BA-2010 AS-CL 8 LP
Master SS-CL, SS-TAC 8 LP
Dozenten/-innen Katja Markert
Veranstaltungsart Vorlesung / Übung
Sprache English
Erster Termin 17.04.2023
Zeit und Ort Montag, 15:15-16:45, INF 325 / SR 24 Donnerstag, 10:15-11:45, INF 329 / SR 26
Commitment-Frist tbd.


  • For MA students: none.
  • BA students:
    • ECL
    • Programming I
    • Programming II or Experimente Gestalten mit Maschinellem Lernen or similar additional knowledge about algorithms, programming and machine learning is good but not strictly necessary.


All advanced Bachelor students and all Master students. Students from Computer Science, Mathematics or Scientific computing with Anwendungsgebiet Computational Linguistics are welcome.


  • Active Participation, including leading discussions, contributing to discussions, demonstrating solutions to exercises in class. Therefore there is an attendance requirement for these sessions.
  • Exercises (4-5 exercise sheets or programming exercises to be worked on in class or in pairs at home).
  • Presentytion
  • Written Exam

Active participation and passing the exercises is a prerequisite for exam participation. The mark will be a weighted average of the presentation mark and the exam mark. Students that do the module as a Hauptseminar will get a more complex presentation topic and a somewhat harder and/or longer exam (and potentially exercise sheets).


In this seminar we will look at various problems that arise with training and test data in NLP. We will look at common pitfalls, why you cannot necessarily believe state-of-the-art results and how to stress test both your data and your systems. The course includes both data construction and investigation methods as well as machine learning methods for identifying data problems and learning with data noise. It will go beyond standard practices that you all know such as training/test splits, significance test etc. or tackle problems which are not necessarily statistical.

In particular, we will include or select from the following topics:

  1. Data Sampling, including methods for sampling, analysis of sample sizes and power, use of opportunistic and silver data.
  2. Data Annotation: including annotation methods (expert annotation, crowd-sourcing etc.), measures for inter-annotator agreement,impact of item order on annotation, impact of annotator bias (does it matter whether your hate speech data for race is annotated by minority members or not?), how to learn with data where annotation is inherently subjective (learning with annotation disagreement or noise)
  3. training noise, data usability: learning with noise, identifying noise and most useful training examples automatically
  4. Data bias and data artefacts: stress tests, adversarial data, challenge datasets, Clever Hans phenomena, counterfactual datasets
  5. Too little data: automatic data augmentation in data space and feature space, potentially some semi-supervised learning if time allows (no unsupervised learning as covered in Learning without annotated data)
  6. Model robustness: how well does your model deal with noise in testing?

Examples will mainly come from the realm of natural language inference, summarization, sentiment and hate speech.

The course is suitable for advanced bachelor students (at least after Orientierungspruefung) and all Master students. It is both suitable for primary linguistic interests as well as ML/algorithm interest.

The course will in the first part be structured as lecture with exercise classes and the second part include also student presentations.




To be announced in first week of term.

» weitere Kursmaterialien

zum Seitenanfang