Problems with Data
Module Description
| Course | Module Abbreviation | Credit Points |
|---|---|---|
| BA-2010[100%|75%] | CS-CL | 6 LP |
| BA-2010[50%] | BS-CL | 6 LP |
| BA-2010[25%] | BS-AC, BS-FL | 4 LP |
| BA-2010 | AS-CL | 8 LP |
| Master | SS-CL-TAC | 8 LP |
| Lecturer | Katja Markert |
| Module Type |
|
| Language | English |
| First Session | 15.04.2026 |
| Time and Place | We. 10:15-11:45, INF 346 / SR 10 Th. 10:15-11:45, INF 346 / SR 10 |
| Commitment Period | tbd. |
Prerequisite for Participation
Participants
All advanced CL Bachelor students and all CL master students. Students from MSc Data and Computer Science or MSc Scientific Computing with Field of Application Computational Linguistics are welcome after getting permission from the lecturer. If the seminar should be oversubscribed, CL students will have priority.
Assessment
- Active Participation, including contributing to discussions and demonstrating solutions to exercises in class. Therefore there is an attendance requirement for exercise and discussion sessions. There is no attendance requirement for lecture sessions.
- 4-5 Exercise sheets
- Written Exam
There will be no presentation, term paper or project requirement.
Active participation and passing the exercises is a prerequisite for exam participation. The mark will be determined by the exam. Students that do the module as a Hauptseminar will get somewhat harder and/or longer exam (and exercise sheets).
Content
In this seminar we will look at various problems that arise with training and test data in NLP. Training data will include pretraining data, instruction-tuning data and RHLF data. We will look at common pitfalls, why you cannot necessarily believe state-of-the-art results and how to stress test both your data and your systems.
The course includes both data construction and investigation methods as well as methods for identifying data problems. It will go beyond standard practices that you all know such as training/test splits, significance test etc. or tackle problems which are not necessarily statistical. It will also include algorithms for data packing, deduplication and annotation reconciliation.
In particular, we will include or select from the following topics:
- Data Sampling, including methods for sampling, analysis of sample sizes and power, use of opportunistic and silver data.
- Pretraining Data for LLMs: quality and toxicity filters, deduplication algorithms, data packing algorithms, methods to examine pretraining data.
- Data Annotation for supervised finetuning and instruction tuning: annotation methods (expert annotation, crowd-sourcing, LLM-as-annotator etc.), measures for inter-annotator agreement, algorithms for annotator reconciliation, impact of annotator bias, learning with annotation disagreement or noise.
- Data for reinforcement learning
- Evaluation data: data artefacts, stress tests, adversarial data, challenge datasets, temporal shifts, dynamic benchmarking, data contamination.
Examples will mainly come from the realm of large language modelling, natural language inference, reasoning and maths, summarization, as well as sentiment and hate speech.
The course is suitable for advanced bachelor students (at least after Orientierungspruefung, better 4rth semester onwards) and all Master students.
The course will be structured as a lecture series with exercise sessions.
Schedule
| Datum | Sitzung | Materialien |
Literature
To be announced in first week of term.


