Ruprecht-Karls-Universität Heidelberg
Institut für Computerlinguistik

Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg

Problems with Data

Module Description

Course Module Abbreviation Credit Points
BA-2010[100%|75%] CS-CL 6 LP
BA-2010[50%] BS-CL 6 LP
BA-2010[25%] BS-AC, BS-FL 4 LP
BA-2010 AS-CL 8 LP
Master SS-CL, SS-TAC 8 LP

Lecturer Katja Markert
Module Type Vorlesung / Übung
Language English
First Session 15.04.2024
Time and Place Mo. 15:15-16:45, INF 326 / SR 28
Th. 10:15-11:45, INF 327 / SR 3
Commitment Period tbd.

Prerequisite for Participation

  • For MA students: none.
  • BA students:
    • ECL
    • Programming I
    • Mathematische Grundlagen
    • Programming II or Experimente Gestalten mit Maschinellem Lernen or similar additional knowledge about algorithms, programming and machine learning is good but not strictly necessary.


All advanced Bachelor students and all Master students. Students from Computer Science, Mathematics or Scientific computing with Anwendungsgebiet Computational Linguistics are welcome.


  • Active Participation, including leading discussions, contributing to discussions, demonstrating solutions to exercises in class. Therefore there is an attendance requirement for exercise and discussion sessions. There is no attendance requirement for lecture sessions.
  • 4-5 Exercises
  • Presentation
  • Written Exam

Active participation and passing the exercises is a prerequisite for exam participation. The mark will be a weighted average of the presentation mark and the exam mark. Students that do the module as a Hauptseminar will get a more complex presentation topic and a somewhat harder and/or longer exam (and potentially exercise sheets).


In this seminar we will look at various problems that arise with training and test data in NLP. We will look at common pitfalls, why you cannot necessarily believe state-of-the-art results and how to stress test both your data and your systems. The course includes both data construction and investigation methods as well as methods for identifying data problems. It will go beyond standard practices that you all know such as training/test splits, significance test etc. or tackle problems which are not necessarily statistical.

In particular, we will include or select from the following topics:

  1. Data Sampling, including methods for sampling, analysis of sample sizes and power, use of opportunistic and silver data.
  2. Pretraining Data for LLMs: toxicity, data contamination, deduplication, methods to examine pretraining data
  3. Data Annotation for finetuning, instruction-tuning and human feedback methods : annotation methods (expert annotation, crowd-sourcing, LLM annotation etc.), measures for inter-annotator agreement,impact of item order on annotation, impact of annotator bias, learning with annotation disagreement or noise, special methods for instruction tuning and RLHF
  4. Data bias and data artefacts: stress tests, adversarial data, challenge datasets, Clever Hans phenomena, counterfactual datasets
  5. Synthetic data and automatic data augmentation: are we running out of data?

Examples will mainly come from the realm of natural language inference, summarization, sentiment and hate speech, large language modelling

The course is suitable for advanced bachelor students (at least after Orientierungspruefung, better 3rd semester onwards) and all Master students. It is both suitable for primary linguistic interests as well as ML/algorithm interest.

The course will in the first part be structured as lecture with exercise classes and the second part will also include student presentations.




To be announced in first week of term.

zum Seitenanfang