Institut für Computerlinguistik

Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg

Englisch

Institut für Computerlinguistik

Resources & Corpora and Software

Resources & Corpora

Abstract Graphs and abstract paths for strongly typed knowledge graphs [LiMo project]
ACL word segmentation correction: Correction of OCR Word Segmentation Errors in Articles from the ACL Collection through Neural Machine Translation Methods [LiMo project]
BoostCLIR: A Japanese-English corpus of patent abstracts for patent prior art search, consisting of 100K queries and relevance judgements for 1.4M documents.
DeCOCO: German translations for 1000 image captions from the COCO dataset.
deModify: A Dataset for Analyzing Contextual Constraints on Modifier Deletion [LiMo project]
GER_SET: Situation Entity Type labelled corpus for German [LiMo project]
German Twitter Embeddings: A set of word2vec-style skip gram embeddings from a Twitter corpus [LiMo project]
German Word Sense Annotation Data Set (Broscheit et al. 2010)
GigaPair documents, alignments and automatically induced implicit arguments
(Roth and Frank, 2012a,b; Roth and Frank, 2013; Roth and Frank, 2015)
GNVN: Two datasets for joint semantic predicate (GermaNet) and semantic role (VerbNet-style) annotation for German (Mújdricza-Maydt et al. 2016)
HeiNER: The Heidelberg Named Entity Resource (Wentland et al., 2008)
HumanMT: Human pairwise and five-point ratings for 1000 translations from German to English.
IKAT: Implied knowledge annotations in argumentative texts (NLDB 2017) [LiMo project]
Lexicon of Abusive Words [LiMo project]
MSC: Modal sense classification (MSC) dataset (Zhou et al., 2015, Marasovic et al. 2016, Frank and Marasovic 2016)
NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval, extracted from NutritionFacts.org.
NLmaps: A corpus for question-answering, consisting of 2,380 questions in English and German with corresponding Machine Readable Language (MRL) formulae, using the OpenStreetMap database.
ON5V: OntoNotes 5 Predicates Non-local Role Linking Data Set (Moor et al., 2013)
PatTR: A parallel patent corpus for statistical machine translation featuring three language pairs, German-English (23M sentences pairs), English-French (19M sentence pairs) and French-German (5M sentence pairs).
SightSee: A Frame-annotated Wayfinding Corpus
SR3de: Semantic Role Triple Dataset for German is a dataset with parallel PropBank-, VerbNet-, and FrameNet-style semantic role annotation on a portion of approx. 3000 instances of the CoNLL 2009 shared task German data (Hartmann et al. 2017)
Twitter Titling Corpus: A Stance-annotated Corpus of tweets mentioning presidents
WikiCaps: A large-scale multilingual data set of image-caption pairs for multimodal machine translation, extracted from Wikimedia Commons.
WikiCLIR: A large-scale German-English retrieval data set for Cross-Language Information Retrieval, extracted from Wikipedia.

Software

cclir: A cross-language information retrieval (CLIR) toolbox based on the cdec decoder, code package used in Bag-of-words Forced Decoding for Cross-Lingual Information Retrieval (Hieber and Riezler, ACL 2015), inter alia.
convert: A python script that converts function-head style encodings in dependency treebanks in a content-head style encoding (as used in the UD treebanks) and vice versa (for adpositions, copula and coordination) (Rehbein, Steen, Do & Frank 2017) [LiMo project]
dtrain: A tuning method implemented for the cdec decoder, see Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT (Simianer, Riezler and Dyer, ACL 2012).
Joey NMT: Minimalist NMT for educational purposes
MACE-AL: A method for detecting annotation noise in automatically annotated data using Bayesian Inference and Active Learning [LiMo project]
MACE-AL-TREE: A method for detecting annotation noise in manually and automatically annotated treebanks using Bayesian Inference and Active Learning [LiMo project]
MMAX Extension for Word Sense Annotation (contact: Thomas Bögel)
nematus: A toolkit for neural machine translation.
Neural Monkey: An open-source tool for sequence learning in NLP, WMT 2017 shared task version here.
otedama: Preordering for Machine Translation.
QUETCH: Quality estimation for machine translation.
rebol: A toolkit for grounded learning for statistical machine translation, as described in the ACL 2014 paper, Response-Based Learning for Grounded Machine Translation (Riezler, Simianer and Haas).
semparse: A semantic parser that treats the task as a monolingual SMT problem. The underyling SMT framework is the cdec decoder.