Resources / corpora

 
 

Resources

  • (Baked) Strudel
    Strudel: A corpus-based semantic model based on properties and types.
  • 2005 NIST Speaker Recognition Evaluation Training Data
    2005 NIST Speaker Recognition Evaluation Training Data consists of 392 hours of conversational telephone speech in English, Arabic, Mandarin Chinese, Russian and Spanish and associated English transcripts used as training data in the NIST-sponsored 2005 Speaker Recognition Evaluation (SRE).
  • 2006 NIST Spoken Term Detection Development Set
    2006 NIST Spoken Term Detection Development Set contains approximately eighteen hours of Arabic, Chinese and English broadcast news, English conversational telephone speech and English meeting room speech used in NIST's 2006 Spoken Term Detection (STD) evaluation.
  • 2008 CoNLL Shared Task Data
    The 2008 CoNLL Shared Task Data contains the the trial corpus, training corpus, development and test data for the 2008 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation. The 2008 Shared Task developed syntactic dependency annotations, including information such as named-entity boundaries and the semantic dependencies model roles of both verbal and nominal predicates. The materials in the Shared Task data consist of excerpts from the following corpora: Treebank-3 LDC99T42 , BBN Pronoun Coreference and Entity Type Corpus LDC2005T33, Proposition Bank I LDC2004T14 (PropBank) and NomBank v 1.0 LDC2008T23.
  • ACE 2005 English SpatialML Annotations
    ACE 2005 English SpatialML Annotations applies SpatialML tags to the English newswire and broadcast training data annotated for entities, relations and events in ACE 2005 Multilingual Training Corpus.
  • ACE-2
    The objective of the ACE program is to develop extraction technology to support automatic processing of source language data (in the form of natural text, and as text derived from ASR and OCR). This includes classification, filtering, and selection based on the language content of the source data, i.e., based on the meaning conveyed by the data.
  • ACL Anthology Reference Corpus
    The ACL Anthology Reference Corpus is a corpus of scholarly publications about Computational Linguistics. This corpus is a canonicalized subset of the ACL Anthology, up to February 2007, consisting of 10,921 articles.
  • AOL-DATA
    AOL query logs.
  • AQUAINT-2
    AQUAINT-2 Information-Retrieval Text Research Collection, Linguistic Data Consortium (LDC) catalog number LDC2008T25 and ISBN 1-58563-494-8, was developed by LDC for NIST's (National Institute for Standards and Technology) AQUAINT 2007 Question-Answer (QA) track. It consists of approximately 2.5 GB of English news text from six distinct sources collected by LDC (Agence France Presse, Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington Post, New York Times and Xinhua News Agency) covering the period from October 2004 through March 2006. The AQUAINT-2 collection is the second part of a series intended to provide data useful for developing, evaluating and testing information extraction and retrieval systems. It follows the publication of The AQUAINT Corpus of English News Text (LDC2002T31).
  • Amazon Multi-Domain Sentiment Dataset
    The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from many product types (domains). Reviews contain star ratings (1 to 5 stars).
  • American National Corpus
    The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward.
  • British Academic Written English Corpus
    The BAWE corpus contains 2761 pieces of proficient assessed student writing, ranging in length from about 500 words to about 5000 words. Holdings are fairly evenly distributed across four broad disciplinary areas (Arts and Humanities, Social Sciences, Life Sciences and Physical Sciences) and across four levels of study (undergraduate and taught masters level). Thirty-five disciplines are represented. The assignments have been annotated using a system devised in accordance with the TEI guidelines.
  • BBN Pronoun Coreference and Entity Type Corpus
    This publication supplements the one million word Penn Treebank corpus of Wall Street Journal texts (LDC95T7). The corpus contains stand-off annotation of pronoun coreference, indicated by sentence and token numbers, as well as annotation of a variety of entity and numeric types.
  • BLLIP NANC Treebank
    The BLLIP NANC corpus contains a Penn Treebank-style parsing of approximately 24 million sentences from the North American News Text Corpus (LDC95T21). The North American News Text Corpus consists of English news text from the Los Angeles Times-Washington Post (1994-1997), the New York Times (1994-1996), Reuters News Service (1994-1996) and the Wall Street Journal (1994-1996).
  • British National Corpus
    The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
  • C&C Corpora
    These are the corpora distributed with the C&C tools. Please refer to the documentation of the latter for more information.
  • CCGBank
    CCGbank is a translation of the Penn Treebank into a corpus of Combinatory Categorial Grammar derivations. It pairs syntactic derivations with sets of word-word dependencies which approximate the underlying predicate-argument structure.
  • CELEX2
    This corpus contains ASCII versions of the CELEX lexical databases of English (Version 2.5), Dutch (Version 3.1) and German (Version 2.5).
  • CORPS
    CORPS is a corpus of political speeches tagged with specific audience reactions, such as APPLAUSE or LAUGHTER.
  • CQP
    The IMS Open Corpus Workbench (CWB) is a collection of tools for managing and querying large text corpora (100 M words and more) with linguistic annotations. Its central component is the flexible and efficient query processor CQP.
  • Chinese Proposition Bank
    Chinese Proposition Bank 2.0 is a continuation of the Chinese Propostion Bank project, which aims to create a corpus of Chinese text annotated with information about basic semantic propositions.
  • Chinese Treebank
    The Chinese Treebank, started at University of Pennsylvania, is a segmented, part-of-speech tagged, and fully bracketed corpus that currently has 780 thousand words (over 1.28 Million Chinese characters).
  • CoNLL 2011 ST data set
    The CoNLL 2011 Shared Task data set uses a subset of the OntoNotes-4.0 English corpus.
  • CoNLL 2012 ST data set
    The CoNLL 2012 Shared Task data set uses a subset of the OntoNotes-5.0 corpus.
  • CoNLL NER
    This is the 20030423 release of the data for the CoNLL-2003 shared task. The CoNLL-2003 shared task deals with Language-Independent Named Entity Recognition. Specifically, the two languages considered are English and German.
  • CoNLL SRL
    This is the 20050314 release of the data and associated software for the CoNLL-2005 shared task. The shared task of CoNLL-2005 concerns the recognition of semantic roles, for the English language.
  • Datasets for Generic Relation Extraction (reACE)
    Datasets for Generic Relation Extraction (reACE) consists of English broadcast news and newswire data originally annotated for the ACE (Automatic Content Extraction) program to which the Edinburgh Regularized ACE (reACE) mark-up has been applied.
  • Dependency-parsed British National Corpus
    The BNC parsed with the Clark and Curran Dependency Parser
  • Enron News Corpus
    It contains data from about 150 users, mostly senior management of Enron, organized into folders.
  • Europarl
    This is a parallel corpus that was extracted from the European Parliament web site by Philipp Koehn (USC/ISI). It is faily big, 25-30 million words per language pair, and its main intended use is to aid statistical machine translation research.
  • Europarl
    This is a parallel corpus that was extracted from the European Parliament web site by Philipp Koehn (USC/ISI). It is faily big, 25-30 million words per language pair, and its main intended use is to aid statistical machine translation research.
  • FrameNet
    The Berkeley FrameNet project is creating an on-line lexical resource for English, based on frame semantics and supported by corpus evidence. The aim is to document the range of semantic and syntactic combinatory possibilities (valences) of each word in each of its senses, through computer-assisted annotation of example sentences and automatic tabulation and display of the annotation results.
  • GALE Phase 1 Arabic Blog Parallel Text
    Blogs are posts to informal web-based journals of varying topical content. GALE Phase 1 Arabic Blog Parallel Text was prepared by the LDC and consists of 102K words (222 files) of Arabic blog text and its English translation from thirty-three sources. This release was used as training data in Phase 1 of the DARPA-funded GALE program. (LDC2008T02)
  • English Gigaword 5th Edition
    The English Gigaword Corpus is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) at the University of Pennsylvania. This is the fifth edition of the English Gigaword Corpus.
  • HCRC Map Task Corpus
    The HCRC Map Task Corpus is a set of 128 dialogues that has been recorded, transcribed, and annotated for a wide range of behaviours, and has been released for research purposes.
  • Heise-Newsticker Meldungen
    News appeared at the heise-ticker, a German platform for IT news.
  • JRC-Acquis
    The JRC-Acquis Multilingual Parallel Corpus is the total body of EU law applicable in the member states. Contains 22 different languages.
  • MSLR
    Microsoft Learning to Rank
  • MUC 6
    This corpus contains the annotated Wall Street Journal articles, the scoring software and the corresponding documentation used in the MUC 6 evaluation. Both the MUC 6 Additional News Text and the MUC 6 corpus are necessary in order to replicate the evaluation. All the materials are published as received from the corpus creators, without any quality control being done at the LDC (the only difference is that the files have been uncompressed).
    »
    « : index
     
  • Manually Annotated Sub-Corpus First Release
  • MonaSearch
    MonaSearch is a powerful query tool for linguistic treebanks.
    »
    « 0.3: index
     
  • NEGRA
    10.000 sentences from the German newspaper "Frankfurter Rundschau", annotated with parts of speech and syntactic structures.
  • NXT
    NXT is a set of libraries and tools that provide for the native representation, manipulation, query and analysis of multimedia language data.
  • NomBank
    NomBank is an annotation project at New York University that is related to the PropBank project at the University of Colorado. Our goal is to mark the sets of arguments that cooccur with nouns in the PropBank Corpus (the Wall Street Journal Corpus of the Penn Treebank), just as PropBank records such information for verbs.
  • North American News Text, Complete
    The NANC is a collection of English news text from the Los Angeles Times, Washington Post, New York Times, Reuters and the Wall Street Journal.
  • OntoNotes
    The goal of the OntoNotes project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, use net, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).
    »
    « 3.0: index | 2.0: index
     
  • Open for Questions Corpus
    This corpus consists of 11 files corresponding to the 11 categories of the "Open for Questions" event on whitehouse.gov in March of 2009. In the course of this event, Americans submitted over 100,000 questions which they wanted President Obama to answer. Each file contains close to 1,000 questions from the respective category extracted from the "Open for Questions" page.
  • OpenSubtitles
    This is a collection of movie subtitles in various languages, tokenized and aligned at the sentence level.
    »
    « 0.7: index | 0.3: index
     
  • PAN Plagiarism Corpus
    This corpus contains documents in which plagiarism has been inserted automatically and manually.
  • Penn Discourse Treebank
    The Penn Discourse Treebank (PDTB) is an NSF funded project at the University of Pennsylvania. The goal of the project is to annotate the 1 million word Wall Street Journal corpus in Treebank-2 (LDC95T7) with discourse relations holding between the eventualities and propositions mentioned in text, which serve as the arguments to the relation.
  • Penn Discourse Treebank Version 2.0 Update - RTE data
    Recognizing Textual Entailment (RTE) update for the Penn Discourse Treebank 2.0
  • Penn Treebank
    The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. We have also added the dependency-converted version in CoNLL format.
  • Precompiled Personalized PageRank vectors for all WordNet lemmas
    This is a collection of files which containt the probability vectors for all lemmas in WordNet version 3.0. The vectors have been produced by the ukb_ppv program (http://ixa.si.ehu.es/ukb)
  • Projekt Gutenberg
    The project Gutenberg collects texts which are in the public domain. This collection contains pieces from almost 400 different authors. All of them are in German and formatted as HTML.
  • The Regensburg Parallel Corpus (German - Russian)
    The RPC is a parallel aligned corpus of translated and original belletristic texts in Slavic and some other languages, developed at the Institute of Slavistics at Regensburg University.
  • Reuters Corpus
    A collection of Reuters newswire texts, sorted by months.
  • SALSA
    The data provided by this SALSA release add a layer of role-semantic information to TIGER (release 1), a syntactically annotated German newspaper corpus.
  • SMS Corpus
    This is a corpus of SMS (Short Message Service) messages collected for research at the Department of Computer Science at the National University of Singapore. Currently (April 2004), the corpus consists of about 10,000 SMS messages collected by students. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available.
  • SMULTRON
    SMULTRON (Stockholm MULtilingual TReebank) is a parallel treebank developed by the Computational Linguistics Group at the Department of Linguistics, at Stockholm University. The parallel treebank contains around 1000 sentences in English, German and Swedish. The sentences have been PoS-tagged and annotated with phrase structure trees. The trees have been aligned on sentence, phrase and word level. Additionally, the German and Swedish monolingual treebanks contain lemma information.
    »
    « 1.0: index
     
  • SemCor
    The SemCor corpus, created by the Princeton University, is a subset of the English Brown corpus containing almost 700,000 running words, all are tagged by PoS, and more than 200,000 content words are also lemmatized and sense-tagged according to Princeton WordNet.
    »
    « 2.1: index
     
  • SemEval 2010 Task 10: Linking Events and their Participants in Discourse
    This is the trial, training and testing data from task 10 of SemEval 2010. The training set for both tasks will be annotated with gold standard semantic argument structure and linking information for null instantiations. We annotate the semantic argument structures both in FrameNet and PropBank style.
    »
    « 2010: index | 1.0: index
     
  • SemEval-2007 Task 17: English Lexical Sample, SRL and All Words
  • SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages
    This SemEval-2010 Task 1 release contains approximately 120,000 words extracted from the OntoNotes corpus and formatted for the SemEval task.
  • SemEval-2013 Task 3: Spatial Role Labeling
  • Senseval 3 -- Task 6 (English Lexical Sample)
    The goal of this task is to create a framework for the evaluation of systems that perform Word Sense Disambiguation. By the time Senseval-3 will take place, we estimate to have enough data for about 60 ambiguous nouns, adjectives, and verbs.
  • Susanne
    The SUSANNE scheme attempts to provide a method of representing all aspects of English grammar which are sufficiently definite to be susceptible of formal annotation, with the categories and boundaries between categories specified in sufficient detail that, ideally, two analysts independently annotating the same text and referring to the same scheme must produce the same structural analysis.
  • Szeged Corpus
    Szeged Corpus 2.0, the extension of the first version of the corpus, is a morpho-syntactically analyzed and manually annotated natural language database. It is not only bigger than the first version but apart from contextually selected morpho-syntactic codes, the database also contains the possible codes, so that it is efficiently applicable to the testing of automatic grammatical category annotating methods. The corpus consists of 1.2 million word entries, which cover 155.500 different word forms, and also contains further 250 thousand punctuation marks. Corpus files are available in XML-format, their inner structure is described by the TEIxLite DTD (Document Type Definition) scheme.
    »
    « 2.0: index
     
  • TERN
    This release contains the English training data prepared for the 2004 Time Expression Recognition and Normalization (TERN) Evaluation, sponsored by the Automatic Content Extraction (ACE) program.
  • TIGER
    The TIGER Treebank is a corpus of 40.000 syntactically annotated German newspaper sentences. The annotation scheme used is an extended and improved version of the NEGRA annotation scheme. The conll06-train+test directory contains the dependency-converted corpus used in the CoNLL 2006 Shared Task. We have also added a dependency version which was converted with the pennconverter (default setting; directory dependency-converted), but you will probably want to use the CoNLL06 data.
  • TIGERSearch
    TIGERSearch is a specialized search engine for retrieving information from annotated corpora.
  • The New York Times Annotated Corpus
    The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com.
  • The Tübingen Treebank of Written German
    The TüBa-D/Z treebank is a syntactically annotated, German newspaper corpus based on data taken from the daily issues of 'die tageszeitung' (taz).
    »
    « 5: index | 4: index | 3: index
     
  • TimeBank
    TimeML aims to capture and represent temporal information. This is accomplished using four primary tag types: TIMEX3 for temporal expressions, EVENT for temporal events, SIGNAL for temporal signals, and LINK for representing relationships.
    »
    « 1.1: index
     
  • Twitter data set
    A slightly cleaned up version of the Twitter data gathered through the Twitter's streaming API (http://stream.twitter.com/). The data is released under the Creative Commons license.
  • TypeDM
    Distributional Memory: A general framework for corpus-based semantics
  • UMD Death Penalty Corpus
    The Death Penalty Corpus is a collection of material from Web sites that express views for and against the death penalty.
  • UN Corpora
    The corpus is a paragraph-aligned six-language collection of resolutions of the General Assembly from Volume I of GA regular sessions 55-62. The corpus is described in an academic paper that will be presented (as a poster) at Machine Translation Summit XII on August 28th, 2009.
  • Unified Linguistic Annotation Text Collection
    The Unified Linguistic Annotation (ULA) project seeks to integrate into one framework different layers of annotation (e.g., semantics, discourse, temporal, opinions) using various existing resources, including PropBank, NomBank, TimeBank, Penn Discourse Treebank and coreference and opinion annotations. The Unified Linguistic Annotation Text Collection consists of two separate corpora: The Language Understanding Annotation Corpus (LDC2009T10) and REFLEX EntityTranslation Training/DevTest (LDC2009T11).
    »
    « 1.0: index
     
  • The Universal Declaration of Human Rights
    The Universal Declaration of Human Rights in over 300 different languages. All declarations are taken from http://www.unhchr.ch/udhr/navigate/alpha.htm. They have been downloaded and converted by the script udhr-get.py.
  • VICO Social Media Forum-Korpus
    Jeweils 100.000 Beiträge u den Themen Gesundheit und PC (Anwendungen) ausverschiedenen deutschsprachigen Webforen, inklusive Metainformationen (thread, posting date, ...)
  • ukWaC
    The UK Web Archive contains websites that publish research, that reflect the diversity of lives, interests and activities throughout the UK, and demonstrate web innovation.
  • WaCkypedia
    A 2009 dump of the English Wikipedia (about 800 million tokens), in the same format as PukWaC, including POS/lemma information, as well as a full dependency parse (parsing performed with the MaltParser).
  • deWac
    A 1.7 billion word corpus constructed from the Web limiting the crawl to the .de domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds.
  • frWac
    A 1.6 billion word corpus constructed from the Web limiting the crawl to the .fr domain and using medium-frequency words from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds.
  • pukWaC
    The same as ukWaC, a 2 billion word corpus acquired from the .uk domain, but with a further layer of annotation added, i.e. a full dependency parse. The parsing was performed with the MaltParser.
  • Web 1T 5-gram, 10 European Languages
    Web 1T 5-gram, 10 European Languages Version 1 was created by Google, Inc. It consists of word n-grams and their observed frequency counts for ten European languages: Czech, Dutch, French, German, Italian, Polish, Portuguese, Romanian, Spanish and Swedish. The length of the n-grams ranges from unigrams (single words) to five-grams. The n-gram counts were generated from approximately one hundred billion word tokens of text for each language, or approximately one trillion total tokens.
  • Web 1t 5-gram Corpus
    This data set contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. Data collection took place in January 2006. This means that no text that was created on or after February 1, 2006 was used.
  • WikiXML
    WikiXML is a collection of Wikipedia articles converted to XML format.
  • Leipzig Corpora Collection / Wortschatz
    The Leipzig Corpora Collection presents corpora in different languages using the same format and comparable sources. The sources are either newspaper texts or texts randomly collected from the web. The texts are split into sentences. Non-sentences and foreign language material was removed.
  • Yahoo! Answers Comprehensive Questions and Answers
    Yahoo Webscope Dataset L6
  • Yahoo! Answers Manner Questions
    Yahoo Webscope Dataset L5
  • Yahoo! Answers Question Types
    Yahoo Webscope Dataset L7
  • Yahoo! Answers Search Query Logs for Nine Languages
    Yahoo Webscope Dataset L8
  • Yahoo! Learning to Rank Challenge
    Yahoo Webscope Dataset L14
  • Die Zeit online
    News appeared on the website of "Die Zeit", a German weekly magazin. The articles (1999-2001) have been retrieved from the webpage.
  • musiXmatch dataset
    The MXM dataset provides lyrics for many MSD tracks. The lyrics come in bag-of-words format: each track is described as the word-counts for a dictionary of the top 5,000 words across the set.
  • sdewac
    A 0.88 billion word corpus derived from deWaC, duplicate sentences and some noise have been removed. The corpus has been converted to Unicode. SdeWaC comes in two versions, in POS-tagged / lemmatized version or as a one sentence per line format, each supplemented with metadata (e.g. parse error rate).