Resources

Books & Tutorials

Validity, Reliability, and Significance

Monograph on Empirical Methods for NLP and Data Science

Machine Learning

Tutorial

Tutorial on Statistical Methods for Reproducible Machine Learning

Machine Learning

Software

Open-Source Software Projects hosted by our group:

Contrastive Markings

Code for experiments combining postedits and online markings, from the EAMT 2023 paper, Enhancing Supervised Learning...

Machine Translation

Joey NMT

Minimalist NMT for educational purposes.

Machine Translation Speech Recognition

sparse_szo

Sparse Perturbations for Improved Convergence in Stochastic Zeroth-Order Optimization.

Optimization

QUETCH

Quality estimation for machine translation.

Machine Translation

cclir

A cross-language information retrieval (CLIR) toolbox based on the cdec decoder, code package used in Bag-of-words Fo...

Information Retrieval

rebol

A toolkit for grounded learning for statistical machine translation, as described in the ACL 2014 paper, Response-Bas...

Machine Translation

dtrain

A tuning method implemented for the cdec decoder, see Joint Feature Selection in Distributed Stochastic Learning for ...

Machine Translation

otedama

Preordering for Machine Translation.

Machine Translation

semparse

A semantic parser that treats the task as a monolingual SMT problem. The underyling SMT framework is the cdec decoder.

Machine Translation

Contributions by our Group to other Open-Source Software Projects:

nematus

A toolkit for neural machine translation.

Machine Translation

Neural Monkey

An open-source tool for sequence learning in NLP.

Machine Translation

Corpora

BoostCLIR

A Japanese-English corpus of patent abstracts for patent prior art search, consisting of 100K queries and relevance j...

Patent

DeCOCO

German translations for 1000 image captions from the COCO dataset.

Image Caption

HumanMT

Human ratings and corrections for translations from German to English and vice-versa.

Machine Translation

LibriVoxDeEn

A corpus for German-to-English Speech Translation and Speech Recognition.

Speech Recognition

map2seq

A dataset consisting of 7,672 Natural Language Landmark Navigation Instructions and corresponding route paths in Open...

Landmark Navigation Instructions

MetaCLIR

Meta-textual information for BoostCLIR and the Large Scale CLIR Dataset (wiki-clir).

Information Retrieval

NFCorpus

A Full-Text Learning to Rank Dataset for Medical Information Retrieval, extracted from NutritionFacts.org.

Medical

NLmaps

A corpus for question-answering, consisting of 2,380 questions in English and German with corresponding Machine Reada...

Question Answering

PatTR

A parallel patent corpus for statistical machine translation featuring three language pairs, German-English (23M sent...

Patent

SepsisExp

SepsisExp is a dataset consisting of timelines of patient health data with sepsis labels assigned by senior physicians.

Medical

WikiCaps

A large-scale multilingual data set of image-caption pairs for multimodal machine translation, extracted from Wikimed...

Image Caption

WikiCLIR

A large-scale German-English retrieval data set for Cross-Language Information Retrieval, extracted from Wikipedia.

Information Retrieval