Resources / corpora / l / de

 
 

Resources

  • CELEX2
    This corpus contains ASCII versions of the CELEX lexical databases of English (Version 2.5), Dutch (Version 3.1) and German (Version 2.5).
  • CoNLL NER
    This is the 20030423 release of the data for the CoNLL-2003 shared task. The CoNLL-2003 shared task deals with Language-Independent Named Entity Recognition. Specifically, the two languages considered are English and German.
  • Europarl
    This is a parallel corpus that was extracted from the European Parliament web site by Philipp Koehn (USC/ISI). It is faily big, 25-30 million words per language pair, and its main intended use is to aid statistical machine translation research.
  • Heise-Newsticker Meldungen
    News appeared at the heise-ticker, a German platform for IT news.
  • NEGRA
    10.000 sentences from the German newspaper "Frankfurter Rundschau", annotated with parts of speech and syntactic structures.
  • Projekt Gutenberg
    The project Gutenberg collects texts which are in the public domain. This collection contains pieces from almost 400 different authors. All of them are in German and formatted as HTML.
  • Reuters Corpus
    A collection of Reuters newswire texts, sorted by months.
  • SALSA
    The data provided by this SALSA release add a layer of role-semantic information to TIGER (release 1), a syntactically annotated German newspaper corpus.
  • SMULTRON
    SMULTRON (Stockholm MULtilingual TReebank) is a parallel treebank developed by the Computational Linguistics Group at the Department of Linguistics, at Stockholm University. The parallel treebank contains around 1000 sentences in English, German and Swedish. The sentences have been PoS-tagged and annotated with phrase structure trees. The trees have been aligned on sentence, phrase and word level. Additionally, the German and Swedish monolingual treebanks contain lemma information.
    »
    « 1.0: index
     
  • SemEval 2010 Task 1: Coreference Resolution in Multiple Languages
    The task is concerned with intra-document coreference resolution for six different languages: Catalan, Dutch, English, German, Italian and Spanish. The core of the task is to identify which noun phrases (NPs) in a text refer to the same discourse entity.
  • TIGER
    The TIGER Treebank is a corpus of 40.000 syntactically annotated German newspaper sentences. The annotation scheme used is an extended and improved version of the NEGRA annotation scheme. The conll06-train+test directory contains the dependency-converted corpus used in the CoNLL 2006 Shared Task. We have also added a dependency version which was converted with the pennconverter (default setting; directory dependency-converted), but you will probably want to use the CoNLL06 data.
  • The Tübingen Treebank of Written German
    The TüBa-D/Z treebank is a syntactically annotated, German newspaper corpus based on data taken from the daily issues of 'die tageszeitung' (taz).
    »
    « 5: index | 4: index | 3: index
     
  • VICO Social Media Forum-Korpus
    Jeweils 100.000 Beiträge u den Themen Gesundheit und PC (Anwendungen) ausverschiedenen deutschsprachigen Webforen, inklusive Metainformationen (thread, posting date, ...)
  • Leipzig Corpora Collection / Wortschatz
    The Leipzig Corpora Collection presents corpora in different languages using the same format and comparable sources. The sources are either newspaper texts or texts randomly collected from the web. The texts are split into sentences. Non-sentences and foreign language material was removed.