Resources / corpora / l / es

 
 

Resources

  • 2005 NIST Speaker Recognition Evaluation Training Data
    2005 NIST Speaker Recognition Evaluation Training Data consists of 392 hours of conversational telephone speech in English, Arabic, Mandarin Chinese, Russian and Spanish and associated English transcripts used as training data in the NIST-sponsored 2005 Speaker Recognition Evaluation (SRE).
  • Europarl
    This is a parallel corpus that was extracted from the European Parliament web site by Philipp Koehn (USC/ISI). It is faily big, 25-30 million words per language pair, and its main intended use is to aid statistical machine translation research.
  • Reuters Corpus
    A collection of Reuters newswire texts, sorted by months.
  • SemEval 2010 Task 1: Coreference Resolution in Multiple Languages
    The task is concerned with intra-document coreference resolution for six different languages: Catalan, Dutch, English, German, Italian and Spanish. The core of the task is to identify which noun phrases (NPs) in a text refer to the same discourse entity.
  • UN Corpora
    The corpus is a paragraph-aligned six-language collection of resolutions of the General Assembly from Volume I of GA regular sessions 55-62. The corpus is described in an academic paper that will be presented (as a poster) at Machine Translation Summit XII on August 28th, 2009.