Ruprecht-Karls-Universität Heidelberg
Institut für Computerlinguistik

Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg

Reproducible Machine Learning

Module Description

Course Module Abbreviation Credit Points
Bachelor CL AS-CL 8 LP
Master CL SS-CL, SS-TAC 8 LP
Seminar Informatik BA + MA 4 LP
Anwendungsgebiet Informatik MA 8 LP
Anwendungsgebiet SciComp MA 8 LP
Lecturer Stefan Riezler
Module Type Seminar
Language English
First Session 16.04.2024
Time and Place Tuesday, 14:15 - 15:45
Mathematikon SR11
Commitment Period tbd.


Advanced Bachelor students and all Master students. Students from Computer Science or Scientific computing, especially those with application area Computational Linguistics are welcome.

Prerequisite for Participation

Good knowledge of statistical machine learning and experience in experimental work.


  • Regular and active participation (discussion of presented papers during seminar sessions)
  • Oral presentation (30min presentation + 15min discussion, commitment for presentation by April 23, 2024, by email stating 3 ranked preferences)
  • Implementation project and written report (required for 8 LP) or written term paper (required for 4 LP) (5 pages, accompanied by signed declaration of independence of authorship, deadline end of semester)


Reproducibility of experimental results is one of the fundamental pillars of scientific research. If neither a reliable nor significant evaluation result can be obtained when replicating an experiment, the whole methodological foundation of the research result becomes questionable, even casting doubt on its validity.

In this seminar we will learn about several sources of nondeterminism that hamper reproducibility, and about statistical reliability and significance tests to allow us to analyze the inferential reproducibility of machine learning research. This means that instead of removing all sources of measurement noise, we will incorporate certain types of variance as irreducible conditions of measurement, and analyze their interaction with data properties, with the aim to draw inferences beyond particular instances of trained models.
We will show how to incorporate meta-parameter variations and data properties into statistical significance testing with Generalized Likelihood Ratio Tests (GLRTs), how to use variance component analysis based on Linear Mixed Effects Models (LMEMs) to analyze the contribution of noise sources to overall variance, and how to compute a reliability coefficient as indicator for reproducibility.


Date Material Presenter
16.4. Orga Riezler
23.4. Introduction
Hagmann and Riezler, 2023. Towards Inferential Reproducibility of Machine Learning Research.
7.5. Sources of Nondeterminism: Implementation-Level
[1] Pham et al., 2021. Problems and opportunities in training deep learning software systems: An analysis of variance.
Further reading:
[2] Zhuang et al., 2022. Randomness in neural network training: Characterizing the impact of tooling.
[1] Asma Motmem
14.5. Sources of Nondeterminism: Optimizer-Level
[4] Schmidt et al., 2021. Descending through a crowded valley - benchmarking deep learning optimizers.
Further reading:
[5] Ahn et al., 2022. Reproducibility in optimization: Theoretical framework and limits.
[4] Yu-Chuan Cheng
21.5. Source of Nondeterminism: Metaparameter Variation
[6] Melis et al., 2018. On the state of the art of evaluation in neural language models.
Further reading:
[7] Reimers and Gurevych, 2017. Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging.
[6]Lisa Jockwitz
[7] Sophia Wikinger
28.5. Sources of Nondeterminism: Evaluation Metrics
[8] Chen et al., 2022. Reproducibility issues for BERT-based evaluation metrics.
Further reading:
[9] Post, 2018. A call for clarity in reporting BLEU scores.
[8] Siddhant Tripathi
[9] Bingyu Guo
4.6. Sources of Nondeterminism: Data Splits
[10] Sogaard et al., 2021. We need to talk about random splits.
Further reading:
[11] Gorman and Bedrick, 2019. We need to talk about standard splits.
[10] Lydia Körber
[11] Xinyue Cheng
11.6. Reliability Measures: Bootstrap Confidence Intervals
[12] Agarwal et al., 2021. Deep reinforcement learning at the edge of the statistical precipice.
Further reading:
[13] Henderson et al., 2018. Deep reinforcement learning that matters.
[12] Marlon Dittes
[13] Hammad Aamer
18.6. Reliability Measures: Variance Component Analysis and Intra-Class Correlation Coefficient
[14] Chapter 3 of Riezler and Hagmann, 2022. Validity, Reliability, and Significance: Empirical Methods for NLP and Data Science.
Further reading:
[15] Ferro and Silvello, 2016. A general linear mixed models approach to study system component effects.
[14] Dana Simedrea
[15] Marko Lukosek
25.6. Significance Testing: Abandon p-values?
[16] McShane et al., 2019. Abandon statistical significance.
Further reading:
[17] Colquhoun, 2017. The reproducibility of research and the misinterpretation of p-values.
[16] Muskan Hashim
2.7. Significance Testing: Score Distribution Comparison
[18] Dror et al., 2019. Deep dominance - how to properly compare deep neural models.
Further reading:
[19] Ulmer et al., 2022. deep-significance - Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks.
[18] Yanxin Jia
9.7. Significance Testing: Bootstrap and Randomization
[20] Clark et al, 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability.
Further reading:
[21] Sellam at al., 2022. The multiBERTs: BERT reproductions for robustness analysis.
[20] Paul Stefan Saegert
16.7. Significance Testing: The Generalized Likelihood Ratio Test
[22] Chapter 4 of Riezler and Hagmann, 2022. Validity, Reliability, and Significance: Empirical Methods for NLP and Data Science.
Further reading:
[23] Robertson and Kanoulas, 2012. On per-topic variance in IR evaluation.
[22] David Schwenke
23.7. Implementation Project Discussion
Inferential Reproducibility Toolkit
zum Seitenanfang