-
2008 CoNLL Shared Task Data
The 2008 CoNLL Shared Task Data contains the the trial corpus, training corpus, development and test data for the 2008 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation. The 2008 Shared Task developed syntactic dependency annotations, including information such as named-entity boundaries and the semantic dependencies model roles of both verbal and nominal predicates. The materials in the Shared Task data consist of excerpts from the following corpora: Treebank-3 LDC99T42 , BBN Pronoun Coreference and Entity Type Corpus LDC2005T33, Proposition Bank I LDC2004T14 (PropBank) and NomBank v 1.0 LDC2008T23.
-
ACE 2005 English SpatialML Annotations
ACE 2005 English SpatialML Annotations applies SpatialML tags to the English newswire and broadcast training data annotated for entities, relations and events in ACE 2005 Multilingual Training Corpus.
-
ACE-2
The objective of the ACE program is to develop extraction technology to support
automatic processing of source language data (in the form of natural text, and as text derived
from ASR and OCR). This includes classification, filtering, and selection based on the language
content of the source data, i.e., based on the meaning conveyed by the data.
-
AOL-DATA
-
American National Corpus
The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward.
-
BBN Pronoun Coreference and Entity Type Corpus
This publication supplements the one million word Penn Treebank corpus of Wall Street Journal texts (LDC95T7). The corpus contains stand-off annotation of pronoun coreference, indicated by sentence and token numbers, as well as annotation of a variety of entity and numeric types.
-
BLLIP NANC Treebank
The BLLIP NANC corpus contains a Penn Treebank-style parsing of approximately 24 million sentences
from the North
American News Text Corpus (LDC95T21). The North American News Text Corpus
consists of English news text from the Los Angeles Times-Washington Post (1994-1997),
the New York Times (1994-1996), Reuters News Service (1994-1996) and the Wall
Street Journal (1994-1996).
-
British National Corpus
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
-
C&C Corpora
These are the corpora distributed with the C&C tools. Please refer to the documentation of the latter for more information.
-
CCGBank
CCGbank is a translation of the Penn Treebank into a corpus of Combinatory
Categorial Grammar derivations. It pairs syntactic derivations with sets of word-word
dependencies which approximate the underlying predicate-argument structure.
-
CELEX2
This corpus contains ASCII versions of the CELEX lexical databases of English (Version 2.5), Dutch (Version 3.1) and German (Version 2.5).
-
CORPS
CORPS is a corpus of political speeches tagged with specific audience reactions, such as APPLAUSE or LAUGHTER.
-
Chinese Proposition Bank
Chinese Proposition Bank 2.0 is a continuation of the Chinese Propostion Bank project, which aims to create a corpus of Chinese text annotated with information about basic semantic propositions.
-
Chinese Treebank
The Chinese Treebank, started at University of Pennsylvania, is a segmented, part-of-speech tagged, and fully bracketed corpus that currently has 780 thousand words (over 1.28 Million Chinese characters).
-
CoNLL SRL
This is the 20050314 release of the data and associated software for
the CoNLL-2005 shared task. The shared task of CoNLL-2005 concerns the recognition of semantic roles, for the English language.
-
Datasets for Generic Relation Extraction (reACE)
Datasets for Generic Relation Extraction (reACE) consists of English broadcast news and newswire data originally annotated for the ACE (Automatic Content Extraction) program to which the Edinburgh Regularized ACE (reACE) mark-up has been applied.
-
Dependency-parsed British National Corpus
The BNC parsed with the Clark and Curran Dependency Parser
-
FrameNet
The Berkeley FrameNet project is creating an on-line lexical resource for English, based on frame semantics and supported by corpus evidence. The aim is to document the range of semantic and syntactic combinatory possibilities (valences) of each word in each of its senses, through computer-assisted annotation of example sentences and automatic tabulation and display of the annotation results.
-
HCRC Map Task Corpus
The HCRC Map Task Corpus is a set of 128 dialogues that has been recorded, transcribed, and annotated for a wide range of behaviours, and has been released for research purposes.
-
MUC 6
This corpus contains the annotated Wall Street Journal articles, the scoring software and the corresponding documentation used in the MUC 6 evaluation. Both the MUC 6 Additional News Text and the MUC 6 corpus are necessary in order to replicate the evaluation. All the materials are published as received from the corpus creators, without any quality control being done at the LDC (the only difference is that the files have been uncompressed).
-
Manually Annotated Sub-Corpus First Release
-
NEGRA
10.000 sentences from the German newspaper "Frankfurter Rundschau", annotated with parts of speech and syntactic structures.
-
NomBank
NomBank is an annotation project at New York University that is related to the PropBank project at the University of Colorado. Our goal is to mark the sets of arguments that cooccur with nouns in the PropBank Corpus (the Wall Street Journal Corpus of the Penn Treebank), just as PropBank records such information for verbs.
-
North American News Text, Complete
The NANC is a collection of English news text from the Los Angeles Times, Washington Post,
New York Times, Reuters and the Wall Street Journal.
-
PAN Plagiarism Corpus
This corpus contains documents in which plagiarism has been inserted automatically and manually.
-
Penn Discourse Treebank
The Penn Discourse Treebank (PDTB) is an NSF funded project at the University of Pennsylvania. The goal of the project is to annotate the 1 million word Wall Street Journal corpus in Treebank-2 (LDC95T7) with discourse relations holding between the eventualities and propositions mentioned in text, which serve as the arguments to the relation.
-
Penn Discourse Treebank Version 2.0 Update - RTE data
Recognizing Textual Entailment (RTE) update for the Penn Discourse Treebank 2.0
-
Penn Treebank
The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. We have also added the dependency-converted version in CoNLL format.
-
SALSA
The data provided by this SALSA release add a layer of role-semantic information to TIGER (release 1), a syntactically annotated German newspaper corpus.
-
SemCor
The SemCor corpus, created by the Princeton University, is a subset of the English Brown corpus containing almost 700,000 running
words, all are tagged by PoS, and more than 200,000 content words are also lemmatized and sense-tagged according to Princeton WordNet.
-
SemEval 2010 Task 10: Linking Events and their Participants in Discourse
This is the trial, training and testing data from task 10 of SemEval 2010. The training set for both tasks will be annotated with gold standard semantic argument structure and linking information for null instantiations. We annotate the semantic argument structures both in FrameNet and PropBank style.
-
SemEval-2007 Task 17: English Lexical Sample, SRL and All Words
-
SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages
This SemEval-2010 Task 1 release contains approximately 120,000 words extracted from the OntoNotes corpus and formatted for the SemEval task.
-
SemEval-2013 Task 3: Spatial Role Labeling
-
Senseval 3 -- Task 6 (English Lexical Sample)
The goal of this task is to create a framework for the evaluation of systems that perform Word Sense Disambiguation. By the time Senseval-3 will take place, we estimate to have enough data for about 60 ambiguous nouns, adjectives, and verbs.
-
Susanne
The SUSANNE scheme attempts to provide a method of representing all aspects of English grammar which are sufficiently definite to be susceptible of formal annotation, with the categories and boundaries between categories specified in sufficient detail that, ideally, two analysts independently annotating the same text and referring to the same scheme must produce the same structural analysis.
-
Szeged Corpus
Szeged Corpus 2.0, the extension of the first version of the corpus, is a morpho-syntactically analyzed and manually annotated natural language database. It is not only bigger than the first version but apart from contextually selected morpho-syntactic codes, the database also contains the possible codes, so that it is efficiently applicable to the testing of automatic grammatical category annotating methods. The corpus consists of 1.2 million word entries, which cover 155.500 different word forms, and also contains further 250 thousand punctuation marks. Corpus files are available in XML-format, their inner structure is described by the TEIxLite DTD (Document Type Definition) scheme.
-
TERN
This release contains the English training data prepared for the
2004 Time Expression Recognition and Normalization (TERN) Evaluation, sponsored by
the Automatic Content Extraction (ACE) program.
-
TIGER
The TIGER Treebank is a corpus of 40.000 syntactically annotated German
newspaper sentences. The annotation scheme used is an extended and improved version of the NEGRA
annotation scheme. The conll06-train+test directory contains the dependency-converted corpus used in the CoNLL 2006 Shared Task. We have also added a dependency version which was converted with the pennconverter (default setting; directory dependency-converted), but you will probably want to use the CoNLL06 data.
-
The New York Times Annotated Corpus
The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com.
-
The Tübingen Treebank of Written German
The TüBa-D/Z treebank is a syntactically annotated, German newspaper corpus based on data taken from the daily issues of 'die tageszeitung' (taz).
-
TimeBank
TimeML aims to capture and represent temporal information. This is accomplished using four primary tag types: TIMEX3 for temporal expressions, EVENT for temporal events, SIGNAL for temporal signals, and LINK for representing relationships.
-
UMD Death Penalty Corpus
The Death Penalty Corpus is a collection of material from Web sites that
express views for and against the death penalty.
-
Unified Linguistic Annotation Text Collection
FactBank 1.0 consists of 208 documents (over 77,000 tokens) from newswire and broadcast news reports in which event mentions are annotated with their degree of factuality, that is, the degree to which they correspond to those events.
-
ukWaC
The UK Web Archive contains websites that publish research, that reflect the diversity of lives, interests and activities throughout the UK, and demonstrate web innovation.
-
WaCkypedia
A 2009 dump of the English Wikipedia (about 800 million tokens), in the same format as PukWaC, including POS/lemma information, as well as a full
dependency parse (parsing performed with the MaltParser).
-
deWac
A 1.7 billion word corpus constructed from the Web limiting the crawl to the .de domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds.
-
frWac
A 1.6 billion word corpus constructed from the Web limiting the crawl to the .fr domain and using medium-frequency words
from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds.
-
pukWaC
The same as ukWaC, a 2 billion word corpus acquired from the .uk domain, but with a further layer of annotation added, i.e. a
full dependency parse. The parsing was performed with the MaltParser.
-
Yahoo! Answers Comprehensive Questions and Answers
Yahoo Webscope Dataset L6
-
Yahoo! Answers Manner Questions
Yahoo Webscope Dataset L5
-
Yahoo! Answers Question Types
Yahoo Webscope Dataset L7
-
Yahoo! Answers Search Query Logs for Nine Languages
Yahoo Webscope Dataset L8
-
Yahoo! Learning to Rank Challenge
Yahoo Webscope Dataset L14
-
sdewac
A 0.88 billion word corpus derived from deWaC, duplicate sentences and some noise have been removed. The corpus has been converted to Unicode. SdeWaC comes in two versions, in POS-tagged / lemmatized version or as a one sentence per line format, each supplemented with metadata (e.g. parse error rate).