Resources / corpora / monolingual

 
 

Resources

  • 2008 CoNLL Shared Task Data
    The 2008 CoNLL Shared Task Data contains the the trial corpus, training corpus, development and test data for the 2008 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation. The 2008 Shared Task developed syntactic dependency annotations, including information such as named-entity boundaries and the semantic dependencies model roles of both verbal and nominal predicates. The materials in the Shared Task data consist of excerpts from the following corpora: Treebank-3 LDC99T42 , BBN Pronoun Coreference and Entity Type Corpus LDC2005T33, Proposition Bank I LDC2004T14 (PropBank) and NomBank v 1.0 LDC2008T23.
  • ACE 2005 English SpatialML Annotations
    ACE 2005 English SpatialML Annotations applies SpatialML tags to the English newswire and broadcast training data annotated for entities, relations and events in ACE 2005 Multilingual Training Corpus.
  • ACE-2
    The objective of the ACE program is to develop extraction technology to support automatic processing of source language data (in the form of natural text, and as text derived from ASR and OCR). This includes classification, filtering, and selection based on the language content of the source data, i.e., based on the meaning conveyed by the data.
  • ACL Anthology Reference Corpus
    The ACL Anthology Reference Corpus is a corpus of scholarly publications about Computational Linguistics. This corpus is a canonicalized subset of the ACL Anthology, up to February 2007, consisting of 10,921 articles.
  • AOL-DATA
    AOL query logs.
  • AQUAINT-2
    AQUAINT-2 Information-Retrieval Text Research Collection, Linguistic Data Consortium (LDC) catalog number LDC2008T25 and ISBN 1-58563-494-8, was developed by LDC for NIST's (National Institute for Standards and Technology) AQUAINT 2007 Question-Answer (QA) track. It consists of approximately 2.5 GB of English news text from six distinct sources collected by LDC (Agence France Presse, Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington Post, New York Times and Xinhua News Agency) covering the period from October 2004 through March 2006. The AQUAINT-2 collection is the second part of a series intended to provide data useful for developing, evaluating and testing information extraction and retrieval systems. It follows the publication of The AQUAINT Corpus of English News Text (LDC2002T31).
  • American National Corpus
    The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward.
  • British Academic Written English Corpus
    The BAWE corpus contains 2761 pieces of proficient assessed student writing, ranging in length from about 500 words to about 5000 words. Holdings are fairly evenly distributed across four broad disciplinary areas (Arts and Humanities, Social Sciences, Life Sciences and Physical Sciences) and across four levels of study (undergraduate and taught masters level). Thirty-five disciplines are represented. The assignments have been annotated using a system devised in accordance with the TEI guidelines.
  • BBN Pronoun Coreference and Entity Type Corpus
    This publication supplements the one million word Penn Treebank corpus of Wall Street Journal texts (LDC95T7). The corpus contains stand-off annotation of pronoun coreference, indicated by sentence and token numbers, as well as annotation of a variety of entity and numeric types.
  • BLLIP NANC Treebank
    The BLLIP NANC corpus contains a Penn Treebank-style parsing of approximately 24 million sentences from the North American News Text Corpus (LDC95T21). The North American News Text Corpus consists of English news text from the Los Angeles Times-Washington Post (1994-1997), the New York Times (1994-1996), Reuters News Service (1994-1996) and the Wall Street Journal (1994-1996).
  • British National Corpus
    The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
  • C&C Corpora
    These are the corpora distributed with the C&C tools. Please refer to the documentation of the latter for more information.
  • CCGBank
    CCGbank is a translation of the Penn Treebank into a corpus of Combinatory Categorial Grammar derivations. It pairs syntactic derivations with sets of word-word dependencies which approximate the underlying predicate-argument structure.
  • CELEX2
    This corpus contains ASCII versions of the CELEX lexical databases of English (Version 2.5), Dutch (Version 3.1) and German (Version 2.5).
  • CORPS
    CORPS is a corpus of political speeches tagged with specific audience reactions, such as APPLAUSE or LAUGHTER.
  • Chinese Proposition Bank
    Chinese Proposition Bank 2.0 is a continuation of the Chinese Propostion Bank project, which aims to create a corpus of Chinese text annotated with information about basic semantic propositions.
  • Chinese Treebank
    The Chinese Treebank, started at University of Pennsylvania, is a segmented, part-of-speech tagged, and fully bracketed corpus that currently has 780 thousand words (over 1.28 Million Chinese characters).
  • CoNLL SRL
    This is the 20050314 release of the data and associated software for the CoNLL-2005 shared task. The shared task of CoNLL-2005 concerns the recognition of semantic roles, for the English language.
  • Datasets for Generic Relation Extraction (reACE)
    Datasets for Generic Relation Extraction (reACE) consists of English broadcast news and newswire data originally annotated for the ACE (Automatic Content Extraction) program to which the Edinburgh Regularized ACE (reACE) mark-up has been applied.
  • Dependency-parsed British National Corpus
    The BNC parsed with the Clark and Curran Dependency Parser
  • Enron News Corpus
    It contains data from about 150 users, mostly senior management of Enron, organized into folders.
  • FrameNet
    The Berkeley FrameNet project is creating an on-line lexical resource for English, based on frame semantics and supported by corpus evidence. The aim is to document the range of semantic and syntactic combinatory possibilities (valences) of each word in each of its senses, through computer-assisted annotation of example sentences and automatic tabulation and display of the annotation results.
  • English Gigaword 5th Edition
    The English Gigaword Corpus is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) at the University of Pennsylvania. This is the fifth edition of the English Gigaword Corpus.
  • HCRC Map Task Corpus
    The HCRC Map Task Corpus is a set of 128 dialogues that has been recorded, transcribed, and annotated for a wide range of behaviours, and has been released for research purposes.
  • Heise-Newsticker Meldungen
    News appeared at the heise-ticker, a German platform for IT news.
  • MUC 6
    This corpus contains the annotated Wall Street Journal articles, the scoring software and the corresponding documentation used in the MUC 6 evaluation. Both the MUC 6 Additional News Text and the MUC 6 corpus are necessary in order to replicate the evaluation. All the materials are published as received from the corpus creators, without any quality control being done at the LDC (the only difference is that the files have been uncompressed).
  • Manually Annotated Sub-Corpus First Release
  • NEGRA
    10.000 sentences from the German newspaper "Frankfurter Rundschau", annotated with parts of speech and syntactic structures.
  • NomBank
    NomBank is an annotation project at New York University that is related to the PropBank project at the University of Colorado. Our goal is to mark the sets of arguments that cooccur with nouns in the PropBank Corpus (the Wall Street Journal Corpus of the Penn Treebank), just as PropBank records such information for verbs.
  • North American News Text, Complete
    The NANC is a collection of English news text from the Los Angeles Times, Washington Post, New York Times, Reuters and the Wall Street Journal.
  • Open for Questions Corpus
    This corpus consists of 11 files corresponding to the 11 categories of the "Open for Questions" event on whitehouse.gov in March of 2009. In the course of this event, Americans submitted over 100,000 questions which they wanted President Obama to answer. Each file contains close to 1,000 questions from the respective category extracted from the "Open for Questions" page.
  • PAN Plagiarism Corpus
    This corpus contains documents in which plagiarism has been inserted automatically and manually.
  • Penn Discourse Treebank
    The Penn Discourse Treebank (PDTB) is an NSF funded project at the University of Pennsylvania. The goal of the project is to annotate the 1 million word Wall Street Journal corpus in Treebank-2 (LDC95T7) with discourse relations holding between the eventualities and propositions mentioned in text, which serve as the arguments to the relation.
  • Penn Discourse Treebank Version 2.0 Update - RTE data
    Recognizing Textual Entailment (RTE) update for the Penn Discourse Treebank 2.0
  • Penn Treebank
    The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. We have also added the dependency-converted version in CoNLL format.
  • Projekt Gutenberg
    The project Gutenberg collects texts which are in the public domain. This collection contains pieces from almost 400 different authors. All of them are in German and formatted as HTML.
  • SALSA
    The data provided by this SALSA release add a layer of role-semantic information to TIGER (release 1), a syntactically annotated German newspaper corpus.
  • SMS Corpus
    This is a corpus of SMS (Short Message Service) messages collected for research at the Department of Computer Science at the National University of Singapore. Currently (April 2004), the corpus consists of about 10,000 SMS messages collected by students. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available.
  • SemCor
    The SemCor corpus, created by the Princeton University, is a subset of the English Brown corpus containing almost 700,000 running words, all are tagged by PoS, and more than 200,000 content words are also lemmatized and sense-tagged according to Princeton WordNet.
    »
    « 2.1: index
     
  • SemEval 2010 Task 10: Linking Events and their Participants in Discourse
    This is the trial, training and testing data from task 10 of SemEval 2010. The training set for both tasks will be annotated with gold standard semantic argument structure and linking information for null instantiations. We annotate the semantic argument structures both in FrameNet and PropBank style.
    »
    « 2010: index
     
  • SemEval-2007 Task 17: English Lexical Sample, SRL and All Words
  • SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages
    This SemEval-2010 Task 1 release contains approximately 120,000 words extracted from the OntoNotes corpus and formatted for the SemEval task.
  • SemEval-2013 Task 3: Spatial Role Labeling
  • Senseval 3 -- Task 6 (English Lexical Sample)
    The goal of this task is to create a framework for the evaluation of systems that perform Word Sense Disambiguation. By the time Senseval-3 will take place, we estimate to have enough data for about 60 ambiguous nouns, adjectives, and verbs.
  • Susanne
    The SUSANNE scheme attempts to provide a method of representing all aspects of English grammar which are sufficiently definite to be susceptible of formal annotation, with the categories and boundaries between categories specified in sufficient detail that, ideally, two analysts independently annotating the same text and referring to the same scheme must produce the same structural analysis.
  • Szeged Corpus
    Szeged Corpus 2.0, the extension of the first version of the corpus, is a morpho-syntactically analyzed and manually annotated natural language database. It is not only bigger than the first version but apart from contextually selected morpho-syntactic codes, the database also contains the possible codes, so that it is efficiently applicable to the testing of automatic grammatical category annotating methods. The corpus consists of 1.2 million word entries, which cover 155.500 different word forms, and also contains further 250 thousand punctuation marks. Corpus files are available in XML-format, their inner structure is described by the TEIxLite DTD (Document Type Definition) scheme.
    »
    « 2.0: index
     
  • TERN
    This release contains the English training data prepared for the 2004 Time Expression Recognition and Normalization (TERN) Evaluation, sponsored by the Automatic Content Extraction (ACE) program.
  • TIGER
    The TIGER Treebank is a corpus of 40.000 syntactically annotated German newspaper sentences. The annotation scheme used is an extended and improved version of the NEGRA annotation scheme. The conll06-train+test directory contains the dependency-converted corpus used in the CoNLL 2006 Shared Task. We have also added a dependency version which was converted with the pennconverter (default setting; directory dependency-converted), but you will probably want to use the CoNLL06 data.
  • The New York Times Annotated Corpus
    The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com.
  • The Tübingen Treebank of Written German
    The TüBa-D/Z treebank is a syntactically annotated, German newspaper corpus based on data taken from the daily issues of 'die tageszeitung' (taz).
    »
    « 5: index | 4: index | 3: index
     
  • TimeBank
    TimeML aims to capture and represent temporal information. This is accomplished using four primary tag types: TIMEX3 for temporal expressions, EVENT for temporal events, SIGNAL for temporal signals, and LINK for representing relationships.
    »
    « 1.1: index
     
  • UMD Death Penalty Corpus
    The Death Penalty Corpus is a collection of material from Web sites that express views for and against the death penalty.
  • Unified Linguistic Annotation Text Collection
    FactBank 1.0 consists of 208 documents (over 77,000 tokens) from newswire and broadcast news reports in which event mentions are annotated with their degree of factuality, that is, the degree to which they correspond to those events.
  • VICO Social Media Forum-Korpus
    Jeweils 100.000 Beiträge u den Themen Gesundheit und PC (Anwendungen) ausverschiedenen deutschsprachigen Webforen, inklusive Metainformationen (thread, posting date, ...)
  • ukWaC
    The UK Web Archive contains websites that publish research, that reflect the diversity of lives, interests and activities throughout the UK, and demonstrate web innovation.
  • WaCkypedia
    A 2009 dump of the English Wikipedia (about 800 million tokens), in the same format as PukWaC, including POS/lemma information, as well as a full dependency parse (parsing performed with the MaltParser).
  • deWac
    A 1.7 billion word corpus constructed from the Web limiting the crawl to the .de domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds.
  • frWac
    A 1.6 billion word corpus constructed from the Web limiting the crawl to the .fr domain and using medium-frequency words from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds.
  • pukWaC
    The same as ukWaC, a 2 billion word corpus acquired from the .uk domain, but with a further layer of annotation added, i.e. a full dependency parse. The parsing was performed with the MaltParser.
  • Web 1t 5-gram Corpus
    This data set contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. Data collection took place in January 2006. This means that no text that was created on or after February 1, 2006 was used.
  • Leipzig Corpora Collection / Wortschatz
    The Leipzig Corpora Collection presents corpora in different languages using the same format and comparable sources. The sources are either newspaper texts or texts randomly collected from the web. The texts are split into sentences. Non-sentences and foreign language material was removed.
  • Yahoo! Answers Comprehensive Questions and Answers
    Yahoo Webscope Dataset L6
  • Yahoo! Answers Manner Questions
    Yahoo Webscope Dataset L5
  • Yahoo! Answers Question Types
    Yahoo Webscope Dataset L7
  • Yahoo! Answers Search Query Logs for Nine Languages
    Yahoo Webscope Dataset L8
  • Yahoo! Learning to Rank Challenge
    Yahoo Webscope Dataset L14
  • Die Zeit online
    News appeared on the website of "Die Zeit", a German weekly magazin. The articles (1999-2001) have been retrieved from the webpage.
  • sdewac
    A 0.88 billion word corpus derived from deWaC, duplicate sentences and some noise have been removed. The corpus has been converted to Unicode. SdeWaC comes in two versions, in POS-tagged / lemmatized version or as a one sentence per line format, each supplemented with metadata (e.g. parse error rate).