-
2005 NIST Speaker Recognition Evaluation Training Data
2005 NIST Speaker Recognition Evaluation Training Data consists of 392 hours of conversational telephone speech in English, Arabic, Mandarin Chinese, Russian and Spanish and associated English transcripts used as training data in the NIST-sponsored 2005 Speaker Recognition Evaluation (SRE).
-
2006 NIST Spoken Term Detection Development Set
2006 NIST Spoken Term Detection Development Set contains approximately eighteen hours of Arabic, Chinese and English broadcast news, English conversational telephone speech and English meeting room speech used in NIST's 2006 Spoken Term Detection (STD) evaluation.
-
2008 CoNLL Shared Task Data
The 2008 CoNLL Shared Task Data contains the the trial corpus, training corpus, development and test data for the 2008 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation. The 2008 Shared Task developed syntactic dependency annotations, including information such as named-entity boundaries and the semantic dependencies model roles of both verbal and nominal predicates. The materials in the Shared Task data consist of excerpts from the following corpora: Treebank-3 LDC99T42 , BBN Pronoun Coreference and Entity Type Corpus LDC2005T33, Proposition Bank I LDC2004T14 (PropBank) and NomBank v 1.0 LDC2008T23.
-
ACE 2005 English SpatialML Annotations
ACE 2005 English SpatialML Annotations applies SpatialML tags to the English newswire and broadcast training data annotated for entities, relations and events in ACE 2005 Multilingual Training Corpus.
-
ACL Anthology Reference Corpus
The ACL Anthology Reference Corpus is a corpus of scholarly publications about Computational Linguistics. This corpus is a canonicalized subset of the ACL Anthology, up to February 2007, consisting of 10,921 articles.
-
AOL-DATA
-
AQUAINT-2
AQUAINT-2 Information-Retrieval Text Research Collection, Linguistic Data Consortium (LDC) catalog number LDC2008T25 and ISBN 1-58563-494-8, was developed by LDC for NIST's (National Institute for Standards and Technology) AQUAINT 2007 Question-Answer (QA) track. It consists of approximately 2.5 GB of English news text from six distinct sources collected by LDC (Agence France Presse, Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington Post, New York Times and Xinhua News Agency) covering the period from October 2004 through March 2006. The AQUAINT-2 collection is the second part of a series intended to provide data useful for developing, evaluating and testing information extraction and retrieval systems. It follows the publication of The AQUAINT Corpus of English News Text (LDC2002T31).
-
American National Corpus
The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward.
-
BBN Pronoun Coreference and Entity Type Corpus
This publication supplements the one million word Penn Treebank corpus of Wall Street Journal texts (LDC95T7). The corpus contains stand-off annotation of pronoun coreference, indicated by sentence and token numbers, as well as annotation of a variety of entity and numeric types.
-
British National Corpus
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
-
C&C Corpora
These are the corpora distributed with the C&C tools. Please refer to the documentation of the latter for more information.
-
CCGBank
CCGbank is a translation of the Penn Treebank into a corpus of Combinatory
Categorial Grammar derivations. It pairs syntactic derivations with sets of word-word
dependencies which approximate the underlying predicate-argument structure.
-
CELEX2
This corpus contains ASCII versions of the CELEX lexical databases of English (Version 2.5), Dutch (Version 3.1) and German (Version 2.5).
-
CORPS
CORPS is a corpus of political speeches tagged with specific audience reactions, such as APPLAUSE or LAUGHTER.
-
CoNLL 2011 ST data set
The CoNLL 2011 Shared Task data set uses a subset of the OntoNotes-4.0
English corpus.
-
CoNLL 2012 ST data set
The CoNLL 2012 Shared Task data set uses a subset of the OntoNotes-5.0
corpus.
-
CoNLL NER
This is the 20030423 release of the data for the CoNLL-2003 shared
task.
The CoNLL-2003 shared task deals with Language-Independent Named
Entity Recognition. Specifically, the two languages considered are English
and German.
-
CoNLL SRL
This is the 20050314 release of the data and associated software for
the CoNLL-2005 shared task. The shared task of CoNLL-2005 concerns the recognition of semantic roles, for the English language.
-
Datasets for Generic Relation Extraction (reACE)
Datasets for Generic Relation Extraction (reACE) consists of English broadcast news and newswire data originally annotated for the ACE (Automatic Content Extraction) program to which the Edinburgh Regularized ACE (reACE) mark-up has been applied.
-
Dependency-parsed British National Corpus
The BNC parsed with the Clark and Curran Dependency Parser
-
Enron News Corpus
It contains data from about 150 users, mostly senior management of Enron, organized into folders.
-
Europarl
This is a parallel corpus that was extracted from the European Parliament web site by Philipp Koehn (USC/ISI). It is faily big, 25-30 million words per language pair, and its main intended use is to aid statistical machine translation research.
-
FrameNet
The Berkeley FrameNet project is creating an on-line lexical resource for English, based on frame semantics and supported by corpus evidence. The aim is to document the range of semantic and syntactic combinatory possibilities (valences) of each word in each of its senses, through computer-assisted annotation of example sentences and automatic tabulation and display of the annotation results.
-
HCRC Map Task Corpus
The HCRC Map Task Corpus is a set of 128 dialogues that has been recorded, transcribed, and annotated for a wide range of behaviours, and has been released for research purposes.
-
MUC 6
This publication contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from LDC. This data was previously distributed as an e-corpus (LDC2005E18) to participants in the 2005 ACE evaluation.
-
Manually Annotated Sub-Corpus First Release
-
NomBank
NomBank is an annotation project at New York University that is related to the PropBank project at the University of Colorado. Our goal is to mark the sets of arguments that cooccur with nouns in the PropBank Corpus (the Wall Street Journal Corpus of the Penn Treebank), just as PropBank records such information for verbs.
-
OntoNotes
The goal of the OntoNotes project
is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, use net, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).
-
PAN Plagiarism Corpus
This corpus contains documents in which plagiarism has been inserted automatically and manually.
-
Penn Discourse Treebank
The Penn Discourse Treebank (PDTB) is an NSF funded project at the University of Pennsylvania. The goal of the project is to annotate the 1 million word Wall Street Journal corpus in Treebank-2 (LDC95T7) with discourse relations holding between the eventualities and propositions mentioned in text, which serve as the arguments to the relation.
-
Penn Discourse Treebank Version 2.0 Update - RTE data
Recognizing Textual Entailment (RTE) update for the Penn Discourse Treebank 2.0
-
Penn Treebank
The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. We have also added the dependency-converted version in CoNLL format.
-
Reuters Corpus
A collection of Reuters newswire texts, sorted by months.
-
SMULTRON
SMULTRON (Stockholm MULtilingual TReebank) is a parallel treebank developed by the Computational Linguistics Group at the Department of Linguistics, at Stockholm University. The parallel treebank contains around 1000 sentences in English, German and Swedish. The sentences have been PoS-tagged and annotated with phrase structure trees. The trees have been aligned on sentence, phrase and word level. Additionally, the German and Swedish monolingual treebanks contain lemma information.
-
SemCor
The SemCor corpus, created by the Princeton University, is a subset of the English Brown corpus containing almost 700,000 running
words, all are tagged by PoS, and more than 200,000 content words are also lemmatized and sense-tagged according to Princeton WordNet.
-
SemEval 2010 Task 10: Linking Events and their Participants in Discourse
This is the trial, training and testing data from task 10 of SemEval 2010. The training set for both tasks will be annotated with gold standard semantic argument structure and linking information for null instantiations. We annotate the semantic argument structures both in FrameNet and PropBank style.
-
SemEval-2007 Task 17: English Lexical Sample, SRL and All Words
-
SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages
This SemEval-2010 Task 1 release contains approximately 120,000 words extracted from the OntoNotes corpus and formatted for the SemEval task.
-
SemEval-2013 Task 3: Spatial Role Labeling
-
Senseval 3 -- Task 6 (English Lexical Sample)
The goal of this task is to create a framework for the evaluation of systems that perform Word Sense Disambiguation. By the time Senseval-3 will take place, we estimate to have enough data for about 60 ambiguous nouns, adjectives, and verbs.
-
Susanne
The SUSANNE scheme attempts to provide a method of representing all aspects of English grammar which are sufficiently definite to be susceptible of formal annotation, with the categories and boundaries between categories specified in sufficient detail that, ideally, two analysts independently annotating the same text and referring to the same scheme must produce the same structural analysis.
-
TERN
This release contains the English training data prepared for the
2004 Time Expression Recognition and Normalization (TERN) Evaluation, sponsored by
the Automatic Content Extraction (ACE) program.
-
The New York Times Annotated Corpus
The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com.
-
TimeBank
TimeML aims to capture and represent temporal information. This is accomplished using four primary tag types: TIMEX3 for temporal expressions, EVENT for temporal events, SIGNAL for temporal signals, and LINK for representing relationships.
-
UN Corpora
The corpus is a paragraph-aligned six-language collection of resolutions of the General Assembly from Volume I of GA regular sessions 55-62. The corpus is described in an academic paper that will be presented (as a poster) at Machine Translation Summit XII on August 28th, 2009.
-
Unified Linguistic Annotation Text Collection
The Unified Linguistic Annotation (ULA) project seeks to integrate into one framework different layers of annotation (e.g., semantics, discourse, temporal, opinions) using various existing resources, including PropBank, NomBank, TimeBank, Penn Discourse Treebank and coreference and opinion annotations. The Unified Linguistic Annotation Text Collection consists of two separate corpora: The Language Understanding Annotation Corpus (LDC2009T10) and REFLEX EntityTranslation Training/DevTest (LDC2009T11).
-
Yahoo! Answers Comprehensive Questions and Answers
Yahoo Webscope Dataset L6
-
Yahoo! Answers Manner Questions
Yahoo Webscope Dataset L5
-
Yahoo! Answers Question Types
Yahoo Webscope Dataset L7
-
Yahoo! Answers Search Query Logs for Nine Languages
Yahoo Webscope Dataset L8
-
Yahoo! Learning to Rank Challenge
Yahoo Webscope Dataset L14