Books & Tutorials

Validity, Reliability, and Significance

Monograph on Empirical Methods for NLP and Data Science

Validity, Reliability, and Significance

Tutorial on Statistical Methods for Reproducible Machine Learning

Software

Open-Source Software Projects hosted by our group:

Contrastive Markings

Code for experiments combining postedits and online markings, from the EAMT 2023 paper, Enhancing Supervised Learning...

Joey NMT

Minimalist NMT for educational purposes.

sparse_szo

Sparse Perturbations for Improved Convergence in Stochastic Zeroth-Order Optimization.

QUETCH

Quality estimation for machine translation.

cclir

A cross-language information retrieval (CLIR) toolbox based on the cdec decoder, code package used in Bag-of-words Fo...

rebol

A toolkit for grounded learning for statistical machine translation, as described in the ACL 2014 paper, Response-Bas...

dtrain

A tuning method implemented for the cdec decoder, see Joint Feature Selection in Distributed Stochastic Learning for ...

otedama

Preordering for Machine Translation.

semparse

A semantic parser that treats the task as a monolingual SMT problem. The underyling SMT framework is the cdec decoder.

Contributions by our Group to other Open-Source Software Projects:

nematus

A toolkit for neural machine translation.

Neural Monkey

An open-source tool for sequence learning in NLP.

Corpora

BoostCLIR

A Japanese-English corpus of patent abstracts for patent prior art search, consisting of 100K queries and relevance j...

DeCOCO

German translations for 1000 image captions from the COCO dataset.

HumanMT

Human ratings and corrections for translations from German to English and vice-versa.

LibriVoxDeEn

A corpus for German-to-English Speech Translation and Speech Recognition.

map2seq

A dataset consisting of 7,672 Natural Language Landmark Navigation Instructions and corresponding route paths in Open...

MetaCLIR

Meta-textual information for BoostCLIR and the Large Scale CLIR Dataset (wiki-clir).

NFCorpus

A Full-Text Learning to Rank Dataset for Medical Information Retrieval, extracted from NutritionFacts.org.

NLmaps

A corpus for question-answering, consisting of 2,380 questions in English and German with corresponding Machine Reada...

PatTR

A parallel patent corpus for statistical machine translation featuring three language pairs, German-English (23M sent...

SepsisExp

SepsisExp is a dataset consisting of timelines of patient health data with sepsis labels assigned by senior physicians.

WikiCaps

A large-scale multilingual data set of image-caption pairs for multimodal machine translation, extracted from Wikimed...

WikiCLIR

A large-scale German-English retrieval data set for Cross-Language Information Retrieval, extracted from Wikipedia.