Ruprecht-Karls-Universität Heidelberg
Institut für Computerlinguistik

Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg

Problems with Data

Module Description

Course Module Abbreviation Credit Points
BA-2010[100%|75%] CS-CL 6 LP
BA-2010[50%] BS-CL 6 LP
BA-2010[25%] BS-AC, BS-FL 4 LP
BA-2010 AS-CL 8 LP
Master SS-CL-TAC 8 LP

Lecturer Katja Markert
Module Type Vorlesung / Übung
Language English
First Session 15.04.2026
Time and Place We. 10:15-11:45, INF 346 / SR 10
Th. 10:15-11:45, INF 346 / SR 10
Commitment Period tbd.

Prerequisite for Participation

  • For MA students: none.
  • BA students:
    • ECL
    • Programming I
    • Programming II and some experience with machine learning experiments or large language models.

Participants

All advanced CL Bachelor students and all CL master students. Students from MSc Data and Computer Science or MSc Scientific Computing with Field of Application Computational Linguistics are welcome after getting permission from the lecturer. If the seminar should be oversubscribed, CL students will have priority.

Assessment

  • Active Participation, including contributing to discussions and demonstrating solutions to exercises in class. Therefore there is an attendance requirement for exercise and discussion sessions. There is no attendance requirement for lecture sessions.
  • 4-5 Exercise sheets
  • Written Exam

There will be no presentation, term paper or project requirement.

Active participation and passing the exercises is a prerequisite for exam participation. The mark will be determined by the exam. Students that do the module as a Hauptseminar will get somewhat harder and/or longer exam (and exercise sheets).

Content

In this seminar we will look at various problems that arise with training and test data in NLP. Training data will include pretraining data, instruction-tuning data and RHLF data. We will look at common pitfalls, why you cannot necessarily believe state-of-the-art results and how to stress test both your data and your systems.

The course includes both data construction and investigation methods as well as methods for identifying data problems. It will go beyond standard practices that you all know such as training/test splits, significance test etc. or tackle problems which are not necessarily statistical. It will also include algorithms for data packing, deduplication and annotation reconciliation.

In particular, we will include or select from the following topics:

  1. Data Sampling, including methods for sampling, analysis of sample sizes and power, use of opportunistic and silver data.
  2. Pretraining Data for LLMs: quality and toxicity filters, deduplication algorithms, data packing algorithms, methods to examine pretraining data.
  3. Data Annotation for supervised finetuning and instruction tuning: annotation methods (expert annotation, crowd-sourcing, LLM-as-annotator etc.), measures for inter-annotator agreement, algorithms for annotator reconciliation, impact of annotator bias, learning with annotation disagreement or noise.
  4. Data for reinforcement learning
  5. Evaluation data: data artefacts, stress tests, adversarial data, challenge datasets, temporal shifts, dynamic benchmarking, data contamination.

Examples will mainly come from the realm of large language modelling, natural language inference, reasoning and maths, summarization, as well as sentiment and hate speech.

The course is suitable for advanced bachelor students (at least after Orientierungspruefung, better 4rth semester onwards) and all Master students.

The course will be structured as a lecture series with exercise sessions.

Schedule

DatumSitzungMaterialien

Literature

To be announced in first week of term.

zum Seitenanfang