Validity, Reliability, and Significance: A Tutorial on Statistical Methods for Reproducible Machine Learning


Validity, Reliability, and Significance: Empirical Methods for NLP and Data Science (book cover)

Scientific progress in machine learning is driven by empirical studies that evaluate the relative quality of models. The goal of such an evaluation is to compare machine learning methods themselves, not to reproduce single test-set evaluations of particular optimized instances of trained models. The practice of reporting performance scores of single best models is particularly inadequate for deep learning because of a strong dependence of their performance on various sources of randomness. Such an evaluation practice raises methodological questions of whether a model predicts what it purports to predict (validity), whether a model’s performance is consistent across replications of the training process (reliability), and whether a performance difference between two models is due to chance (significance). The goal of this tutorial is to provide answers to these questions by concrete statistical tests. The tutorial is hands-on and accompanied by a textbook (Riezler and Hagmann, 2021) and a webpage including R and Python code.


  • Opening remarks
  • Introduction
    • The train-dev-test paradigm
    • Replicability and reproducibility
    • A new paradigm: Inferential reproducibility
  • Mathematical Background: Linear Mixed Effects Models (LMEMs) and Generalized Likelihood Ratio Test (GLRT)
    • General form of LMEMs
    • Significance testing with GLRTs
    • How to fit LMEMs with R and Python
  • Significance
    • The nested models setup
    • Example: Significance testing for neural machine translation under variation in metaparameter and data properties
    • Alternative tests: Bootstrap and permutation tests
  • Reliability
    • Variance component analysis and reliability coefficients
    • Example: Meta-parameter importance and reliability in interactive machine translation
    • Alternative reliability measures: Agreement measures and bootstrap confidence intervals
  • Break
  • Recap: A worked-through example
  • Q&A
  • Mathematical background: Generalized Additive Models (GAMs)
    • General form of model
    • Splines
    • How to fit GAMs with R and Python
  • Validity
    • New concept: Circular features
    • A circularity test based on GAMs
    • Examples: Circularity in deep learning for patent information retrieval and medical data science
  • Q&A
  • Closing remarks and discussion


Prof. Dr. Stefan Riezler

Professor, Department of Computational Linguistics, Heidelberg University, Germany

Michael Hagmann, MSc.

Research Assistant, Department of Computational Linguistics, Heidelberg University, Germany

Tutorial Material

Coming soon …


  1. Stefan Riezler and Michael Hagmann
    Validity, Reliability, and Significance: Empirical Methods for NLP and Data Science
    Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers, 2022
      author = {Riezler, Stefan and Hagmann, Michael},
      title = {Validity, Reliability, and Significance: Empirical Methods for NLP and Data Science},
      publisher = {Morgan \& Claypool Publishers},
      series = {Synthesis Lectures on Human Language Technologies},
      editor = {Hirst, Graeme},
      year = {2022},
      isbn = {9781636392714},
      doi = {10.2200/S01137ED1V01Y202110HLT055},
      url = {}