Tutorial: Reproducible Machine Learning

Overview

Validity, Reliability, and Significance: Empirical Methods for NLP and Data Science (book cover)

Scientific progress in machine learning is driven by empirical studies that evaluate the relative quality of models. The goal of such an evaluation is to compare machine learning methods themselves, not to reproduce single test-set evaluations of particular optimized instances of trained models. The practice of reporting performance scores of single best models is particularly inadequate for deep learning because of a strong dependence of their performance on various sources of randomness. Such an evaluation practice raises methodological questions of whether a model predicts what it purports to predict (validity), whether a model’s performance is consistent across replications of the training process (reliability), and whether a performance difference between two models is due to chance (significance). The goal of this tutorial is to provide answers to these questions by concrete statistical tests. The tutorial is hands-on and accompanied by a textbook (Riezler and Hagmann, 2024) and a webpage including R and Python code.

Introduction (slides)
Mathematical Background: Linear Mixed Effects Models (LMEMs) and Generalized Likelihood Ratio Test (GLRT) (slides)
Significance (slides)
Reliability (slides)
Recap: A worked-through example (slides)
Mathematical background: Generalized Additive Models (GAMs) (slides)
Validity (slides)
Discussion (slides)

Slides

All slides & references in one pdf (download)

Code & Data

Github repo for Python code to conduct an inferential analysis and example data: Code & Data. The inferential analysis examples are in the folder inferential_reproducibility

Presenters

Prof. Dr. Stefan Riezler

Professor, Department of Computational Linguistics, Heidelberg University, Germany

Michael Hagmann, Dr. phil.

Research Assistant, Department of Computational Linguistics, Heidelberg University, Germany

Literature

Stefan Riezler and Michael Hagmann

Validity, Reliability, and Significance: Empirical Methods for NLP and Data Science - Second Edition

Synthesis Lectures on Human Language Technologies, Springer, 2024

link | bib

@book{riezler2024,
  author = {Riezler, Stefan and Hagmann, Michael},
  title = {Validity, Reliability, and Significance: Empirical Methods for NLP and Data Science - Second Edition},
  edition = {Second},
  publisher = {Springer},
  series = {Synthesis Lectures on Human Language Technologies},
  editor = {Hirst, Graeme},
  year = {2024},
  isbn = {978-3-031-57064-3},
  doi = {https://doi.org/10.1007/978-3-031-57065-0}
  url = {https://doi.org/10.1007/978-3-031-57065-0}
}

Michael Hagmann, Philipp Meier and Stefan Riezler

Towards Inferential Reproducibility of Machine Learning Research

The Eleventh International Conference on Learning Representations, 2023

pdf | bib

@inproceedings{hagmann2023towards,
  title = {Towards Inferential Reproducibility of Machine Learning Research},
  author = {Hagmann, Michael and Meier, Philipp and Riezler, Stefan},
  journal = {The Eleventh International Conference on Learning Representations},
  year = {2023},
  url = {https://arxiv.org/abs/2302.04054}
}