Welcome to the Department of Computational Linguistics' resources home page!

These pages document all the resources available at our department. If you are a member, please login with the general coli account to access all the information available (on the top right). If you have any technical problems, please look for an existing ticket concerning your problem or open a new ticket. If you have general questions, contact us at resources@cl.uni-heidelberg.de.

Willkommen auf den Ressourcenseiten des Seminars für Computerlinguistik!

Auf diesen Seiten sind alle am Institut verfügbaren Ressourcen dokumentiert. Wenn Sie Mitglied des SCL sind, loggen Sie sich bitte mit dem allgemeinen coli-Account ein, um Zugang zu allen verfügbaren Informationen zu erhalten (rechts oben). Bei Fragen oder Problemen bitte nach existierenden Tickets suchen oder ein neues Ticket anlegen. Bei allgemeinen Fragen schreiben Sie uns eine E-Mail an resources@cl.uni-heidelberg.de.

Resources /

 
 

Resources

  • (Baked) Strudel
    Strudel: A corpus-based semantic model based on properties and types.
  • 2005 NIST Speaker Recognition Evaluation Training Data
    2005 NIST Speaker Recognition Evaluation Training Data consists of 392 hours of conversational telephone speech in English, Arabic, Mandarin Chinese, Russian and Spanish and associated English transcripts used as training data in the NIST-sponsored 2005 Speaker Recognition Evaluation (SRE).
  • 2006 NIST Spoken Term Detection Development Set
    2006 NIST Spoken Term Detection Development Set contains approximately eighteen hours of Arabic, Chinese and English broadcast news, English conversational telephone speech and English meeting room speech used in NIST's 2006 Spoken Term Detection (STD) evaluation.
  • 2008 CoNLL Shared Task Data
    The 2008 CoNLL Shared Task Data contains the the trial corpus, training corpus, development and test data for the 2008 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation. The 2008 Shared Task developed syntactic dependency annotations, including information such as named-entity boundaries and the semantic dependencies model roles of both verbal and nominal predicates. The materials in the Shared Task data consist of excerpts from the following corpora: Treebank-3 LDC99T42 , BBN Pronoun Coreference and Entity Type Corpus LDC2005T33, Proposition Bank I LDC2004T14 (PropBank) and NomBank v 1.0 LDC2008T23.
  • ACE 2005 English SpatialML Annotations
    ACE 2005 English SpatialML Annotations applies SpatialML tags to the English newswire and broadcast training data annotated for entities, relations and events in ACE 2005 Multilingual Training Corpus.
  • ACE-2
    The objective of the ACE program is to develop extraction technology to support automatic processing of source language data (in the form of natural text, and as text derived from ASR and OCR). This includes classification, filtering, and selection based on the language content of the source data, i.e., based on the meaning conveyed by the data.
  • ACL Anthology Reference Corpus
    The ACL Anthology Reference Corpus is a corpus of scholarly publications about Computational Linguistics. This corpus is a canonicalized subset of the ACL Anthology, up to February 2007, consisting of 10,921 articles.
  • AOL-DATA
    AOL query logs.
  • AQUAINT-2
    AQUAINT-2 Information-Retrieval Text Research Collection, Linguistic Data Consortium (LDC) catalog number LDC2008T25 and ISBN 1-58563-494-8, was developed by LDC for NIST's (National Institute for Standards and Technology) AQUAINT 2007 Question-Answer (QA) track. It consists of approximately 2.5 GB of English news text from six distinct sources collected by LDC (Agence France Presse, Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington Post, New York Times and Xinhua News Agency) covering the period from October 2004 through March 2006. The AQUAINT-2 collection is the second part of a series intended to provide data useful for developing, evaluating and testing information extraction and retrieval systems. It follows the publication of The AQUAINT Corpus of English News Text (LDC2002T31).
  • ASSERT
    ASSERT is an automatic statistical semantic role tagger, that can annotate naturally occuring text with semantic arguments. When presented with a sentence, it performs a full syntactic analysis of the sentence, automatically identifies all the verb predicates in that sentence, extracts features for all constituents in the parse tree relative to the predicate, and identifies and tags the constituents with the appropriate semantic arguments.
    »
    « 0.14beta: index
     
  • Alchemy
    Alchemy is a software package providing a series of algorithms for statistical relational learning and probabilistic logic inference, based on the Markov logic representation.
  • Amazon Multi-Domain Sentiment Dataset
    The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from many product types (domains). Reviews contain star ratings (1 to 5 stars).
  • American National Corpus
    The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward.
  • BART
    BART, the Beautiful/Baltimore Anaphora Resolution Toolkit, is a tool to perform fully automatic machine-learning based automatic coreference annotation on written text.
  • British Academic Written English Corpus
    The BAWE corpus contains 2761 pieces of proficient assessed student writing, ranging in length from about 500 words to about 5000 words. Holdings are fairly evenly distributed across four broad disciplinary areas (Arts and Humanities, Social Sciences, Life Sciences and Physical Sciences) and across four levels of study (undergraduate and taught masters level). Thirty-five disciplines are represented. The assignments have been annotated using a system devised in accordance with the TEI guidelines.
  • BBN Pronoun Coreference and Entity Type Corpus
    This publication supplements the one million word Penn Treebank corpus of Wall Street Journal texts (LDC95T7). The corpus contains stand-off annotation of pronoun coreference, indicated by sentence and token numbers, as well as annotation of a variety of entity and numeric types.
  • BLLIP NANC Treebank
    The BLLIP NANC corpus contains a Penn Treebank-style parsing of approximately 24 million sentences from the North American News Text Corpus (LDC95T21). The North American News Text Corpus consists of English news text from the Los Angeles Times-Washington Post (1994-1997), the New York Times (1994-1996), Reuters News Service (1994-1996) and the Wall Street Journal (1994-1996).
  • Banjo
    Banjo is a software application and framework for structure learning of static and dynamic Bayesian networks.
    »
    « 2.0.1: index
     
  • Berkeley Aligner
    The BerkeleyAligner is a software package that combines the innovations of recent work in unsupervised word alignment at Berkeley. This package is meant both as an alternative to the ubiquitous GIZA++ and as a test bed for new alignment ideas.
    »
    « 2.0: index | 1.0: index
     
  • Berkeley Parser
    A version of the Berkeley Parser trained on TueBa-D/Z. Kept separate since the compiled grammar is not compatible with other ones, e.g. the ones from the original distribution under Google Code.
    »
    « 1.1: index
     
  • Bohnet
    dependency parsing, Bernd Bohnet. 2010. Top Accuracy and Fast Dependency Parsing is not a Contradiction. The 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China.
  • Brill Tagger
    This the original implementation of the Brill Tagger.
  • British National Corpus
    The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
  • Buckwalter Arabic Morphological Analyzer
    The Buckwalter Arabic Morphological Analyzer is used for POS-tagging Arabic text. The data consists primarily of three Arabic-English lexicon files: prefixes (299 entries), suffixes (618 entries), and stems (82,158 entries representing 38,600 lemmas).
  • C&C
    The C&C tools consist of a robust, wide-coverage CCG parser and a number of Maximum Entropy taggers, each of which can be run as a separate program, or combined in one go.
    »
    « 1.0: index
     
  • CCGBank
    CCGbank is a translation of the Penn Treebank into a corpus of Combinatory Categorial Grammar derivations. It pairs syntactic derivations with sets of word-word dependencies which approximate the underlying predicate-argument structure.
  • CDG
    Constraint dependency grammar parsing system
  • CELEX2
    This corpus contains ASCII versions of the CELEX lexical databases of English (Version 2.5), Dutch (Version 3.1) and German (Version 2.5).
  • CMU SLM Toolkit
    The Carnegie Mellon Statistical Language Modeling (CMU SLM) Toolkit is a set of unix software tools designed to facilitate language modeling work in the research community.
  • CORPS
    CORPS is a corpus of political speeches tagged with specific audience reactions, such as APPLAUSE or LAUGHTER.
  • CQP
    The IMS Open Corpus Workbench (CWB) is a collection of tools for managing and querying large text corpora (100 M words and more) with linguistic annotations. Its central component is the flexible and efficient query processor CQP.
  • Chinese Proposition Bank
    Chinese Proposition Bank 2.0 is a continuation of the Chinese Propostion Bank project, which aims to create a corpus of Chinese text annotated with information about basic semantic propositions.
  • Chinese Treebank
    The Chinese Treebank, started at University of Pennsylvania, is a segmented, part-of-speech tagged, and fully bracketed corpus that currently has 780 thousand words (over 1.28 Million Chinese characters).
  • Cluto
    CLUTO is a family of computationally efficient and high-quality data clustering and cluster analysis programs and libraries, that are well suited for low- and high-dimensional data sets.
  • CoNLL 2011 ST data set
    The CoNLL 2011 Shared Task data set uses a subset of the OntoNotes-4.0 English corpus.
  • CoNLL 2012 ST data set
    The CoNLL 2012 Shared Task data set uses a subset of the OntoNotes-5.0 corpus.
  • CoNLL NER
    This is the 20030423 release of the data for the CoNLL-2003 shared task. The CoNLL-2003 shared task deals with Language-Independent Named Entity Recognition. Specifically, the two languages considered are English and German.
  • CoNLL SRL
    This is the 20050314 release of the data and associated software for the CoNLL-2005 shared task. The shared task of CoNLL-2005 concerns the recognition of semantic roles, for the English language.
  • Collins Parser
    The Collins Parser is statistical parser for English.
  • Concept Explorer
    Concept Explorer (ConExp) is a tool, that implements basic functionality needed for study and research of Formal Concept Analysis, including context editing, building concept lattices from context and performing attribute exploration.
  • ConceptNet
    ConceptNet is a semantic network containing lots of things computers should know about the world, especially when understanding text written by people.
    »
    « 2009-06-15: index
     
  • CorScorer
    Perl package for scoring coreference resolution systems using different metrics. This scorer was used in the 2011 CoNLL Shared Task.
  • CrowdFlower
    CrowdFlower is a web interface for designing crowdsourcing tasks. Annotations/judgments can either be ordered from Amazon Mechanical Turk or the like, or you recruit the contributors yourself by providing them with a link to the task. Contact resources@cl... if you want to use our ICL account.
  • Distributional Memory
    Distributional representation of English words a la Baroni & Lenci
  • Dan Bikels Parser
    The software is an extensible, parallel parsing engine that accommodates many different types of generative, statistical parsing models (including an emulation of Mike Collins's parsing model with equally good performance), and can easily be extended to new domains and new languages.
  • Datasets for Generic Relation Extraction (reACE)
    Datasets for Generic Relation Extraction (reACE) consists of English broadcast news and newswire data originally annotated for the ACE (Automatic Content Extraction) program to which the Edinburgh Regularized ACE (reACE) mark-up has been applied.
  • Dependency-parsed British National Corpus
    The BNC parsed with the Clark and Curran Dependency Parser
  • ECLiPSe
    The ECLiPSe Constraint Programming System
  • Enron News Corpus
    It contains data from about 150 users, mostly senior management of Enron, organized into folders.
  • Europarl
    This is a parallel corpus that was extracted from the European Parliament web site by Philipp Koehn (USC/ISI). It is faily big, 25-30 million words per language pair, and its main intended use is to aid statistical machine translation research.
  • Europarl
    This is a parallel corpus that was extracted from the European Parliament web site by Philipp Koehn (USC/ISI). It is faily big, 25-30 million words per language pair, and its main intended use is to aid statistical machine translation research.
  • Extended WordNet
    The goal of this project is to develop a tool that takes as input the current or future versions of WordNet and automatically generates an eXtended WordNet that provides several important enhancements intended to remedy the present limitations of WordNet.
  • Extracting syntactically constrained paraphrases
    Paraphrase extraction software and data used in Chris Callison-Burch's EMNLP-08 paper "Syntactic Constraints on Paraphrases Extracted from Parallel Corpora."
  • FrameNet
    The Berkeley FrameNet project is creating an on-line lexical resource for English, based on frame semantics and supported by corpus evidence. The aim is to document the range of semantic and syntactic combinatory possibilities (valences) of each word in each of its senses, through computer-assisted annotation of example sentences and automatic tabulation and display of the annotation results.
  • Freebase
    Freebase is an open, shared database that contains structured information on millions of topics in hundreds of categories. This information is compiled from open datasets like Wikipedia, MusicBrainz, the Securities and Exchange Commission, and the CIA World Fact Book, as well as contributions from our user community.
    »
    « 29-06-10: index | 15-04-10: index
     
  • GADeL
    GADeL is a Genetic Algorithm for Default Logic implemented in Sicstus Prolog Objects.
  • GALE Phase 1 Arabic Blog Parallel Text
    Blogs are posts to informal web-based journals of varying topical content. GALE Phase 1 Arabic Blog Parallel Text was prepared by the LDC and consists of 102K words (222 files) of Arabic blog text and its English translation from thirty-three sources. This release was used as training data in Phase 1 of the DARPA-funded GALE program. (LDC2008T02)
  • GATE
    GATE is an infrastructure for developing and deploying software components that process human language.
    »
    « 5.1: index
     
  • GIZA++
    GIZA++ is an extension of the program GIZA. It is a program for learning statistical translation models from bitext.
    »
    « 1.02: index | 1.0 (part of EGYPT): index
     
  • GWSD
    GWSD is a system for Unsupervised Graph-based All-Words Word Sense Disambiguation. Please refer to (Sinha and Mihalcea, 2007) for a description of the graph-based disambiguation method, as well as for brief descriptions of all the similarity measures and the graph-centrality algorithms used by GWSD.
  • Geobase
    Geobase demonstrates a natural language interface to a database on U.S. geography.
  • GermaNet
    GermaNet is a lexical-semantic net that has been developed within the LSD Project at the Division of Computational Linguistics of the Linguistics Department at the University of Tübingen
    »
    « 6.0: index | 5.2: index | 5.1: index | 5.0: index
     
  • GermaNet API in Java
    The GermaNet-API provides easy access to all information available in GermaNet, the German word net, for programs written in Java.
  • GermaNet Perl API
    The GermaNet-API provides easy access to all information available in GermaNet, the German word net, for programs written in Perl.
    »
    « 1.2: index | 1.0: index
     
  • German Topological Parser
    This package contains parsing models trained on the TueBaD/Z corpus (specifically the version that was released for the ACL 2008 Parsing German workshop) for use with the Berkeley parser.
  • GibbsLDA++
    GibbsLDA++ is a C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference. It is very fast and is designed to analyze hidden/latent topic structures of large-scale datasets including large collections of text/Web documents. Note: This one does not run on ella.
  • English Gigaword 5th Edition
    The English Gigaword Corpus is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) at the University of Pennsylvania. This is the fifth edition of the English Gigaword Corpus.
  • HCRC Map Task Corpus
    The HCRC Map Task Corpus is a set of 128 dialogues that has been recorded, transcribed, and annotated for a wide range of behaviours, and has been released for research purposes.
  • HILDA
    HILDA (HIgh-Level Discourse Analyzer) is a discourse parser, it analyzes a text and uncovers the underlying functional relations between its different parts. The text is annotated under a theory of text organization called Rhetorical Structure Theory.
  • HTML Parser
    HTML Parser is a Java library used to parse HTML in either a linear or nested fashion. Primarily used for transformation or extraction, it features filters, visitors, custom tags and easy to use JavaBeans. It is a fast, robust and well tested package.
    »
    « 1.6: index
     
  • Hadoop
    Apache Hadoop Core is a software platform that lets one easily write and run applications that process vast amounts of data.
  • Heart of Gold
    The Heart of Gold is an XML-based middleware for the integration of deep and shallow natural language processing components. It provides a uniform and flexible infrastructure for building applications that use RMRS-based and/or XML-based natural language processing components.
  • Heise-Newsticker Meldungen
    News appeared at the heise-ticker, a German platform for IT news.
  • IBM LanguageWare Resource Workbench
    IBM® LanguageWare® is a set of run-time libraries and an easy-to-use Eclipse-based development environment for building custom text analyzers in various languages. The tools make it easy to build dictionaries, ontologies, and rules for identifying key information, relationships and meaning.
    »
    « 7.0.1: index
     
  • IceNLP
    IceNLP is an open source Natural Language Processing (NLP) toolkit for analyzing and processing Icelandic text. The toolkit is implemented in Java.
    »
    « 1.2: index
     
  • Indri
    Indri search engine, part of the Lemur project
  • JGraphT
    JGraphT is a free Java graph library that provides mathematical graph-theory objects and algorithms. Although powerful, JGraphT is designed to be simple and type-safe (via Java generics).
    »
    « 0.7.3: index
     
  • JNET
    The JULIE Lab Named Entity Tagger (JNET) is a generic and configurable multi-class named entity recognizer. JNET's comprehensive feature set allows to employ JNET for most domains and entity classes.
  • JRC-Acquis
    The JRC-Acquis Multilingual Parallel Corpus is the total body of EU law applicable in the member states. Contains 22 different languages.
  • JSBD
    JULIE Sentence Boundary Detector (JSBD) is a ML-based sentence splitter. It can be retrained on supported training material and is thus neither language nor domain dependent.
  • JULIE Token Boundary Detector (JTBD)
    The JULIE Lab Sentence Boundary Detector (JSBD) and the JULIE Lab Token Boundary Detector (JTBD) are machine learning-based tools, developed and optimized for handling life science documents containing many tricky cases which many other, especially rule-based tools, don't handle appropriately.
  • JWN-Similarity
    JWN-Similarity is a Java wrapper library written around Ted Pedersen et al.'s WordNet::Similarity library.
  • JWPL
    JWPL (Java Wikipedia Library) is a free, Java-based application programming interface that allows to access all information contained in Wikipedia.
    »
    « 0.4: index | 0.33: index
     
  • Java FrameNet API
    The FrameNet API can be used to access the FrameNet database and parts of the annotated corpus. One can retrieve information about frames and frame elements, follow the different frame relations, "realize" frames and frame elements by linking them with text, map them, ...
    »
    « 0.3.1: index | 0.3: index
     
  • JavaRAP
    JavaRAP is an implementation of the classic Resolution of Anaphora Procedure (RAP) given by Lappin and Leass (1994) . It resolves third person pronouns, lexical anaphors, and identifies pleonastic pronouns. The original purpose of the implementation is to provide anaphora resolution result to our TREC 2003 Q&A system.
    »
    « 1.11: index
     
  • Jena
    Jena is a Java framework for building Semantic Web applications. It provides a programmatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based inference engine.
    »
    « 2.6.2: index | 2.5.6: index | 2.5.5: index | 2.5.4: index
     
  • LBJ NER Tagger
    This is a state of the art NER tagger that tags plain text with named entitites (people / organizations / locations / miscellaneous).
  • LKB
    The LKB system is a grammar and lexicon development environment for use with unification-based linguistic formalisms. While not restricted to HPSG, the LKB implements the DELPH-IN reference formalism of typed feature structures (jointly with other DELPH-IN software using the same formalism).
  • LT-TTT2
    XML-based software for shallow linguistic processing of text.
  • LibSVM
    LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM). It supports multi-class classification
  • LingPipe
    LingPipe is a suite of Java libraries for information extraction and data mining.
    »
    « 3.9.2: index | 3.7.0: index | 3.5.1: index
     
  • Link Grammar Parser
    The Link Grammar Parser is a syntactic parser of English, based on link grammar, an original theory of English syntax.
    »
    « 4.1b: index
     
  • LoPar
    LoPar is an implementation of a parser for head-lexicalised probabilistic context-free grammars.
  • Lucene
    Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
    »
    « 3.0.1: index | 2.4.0: index | 2.3.2: index
     
  • MALLET
    MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
    »
    « 0.4: index
     
  • MINIPAR
    MINIPAR is a broad-coverage parser for the English language. An evaluation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. MINIPAR is very efficient, on a Pentium II 300 with 128MB memory, it parses about 300 words per second.
  • MIT Java WordNet Interface
    The MIT Java Wordnet Interface (JWI) is an easy-to-use, easy-to-extend Java library for interfacing with Wordnet.
    »
    « 2.1.4: index
     
  • MMAX2
    MMAX2 is a GUI-based text annotation tool for creating and visualizing annotations. It uses a flexible stand-off XML data format, and has advanced and customizable methods for information and relation visualization.
    »
    « 1.13.002: index | 1.12: index
     
  • MSLR
    Microsoft Learning to Rank
  • MSTParser
    MSTParser is a non-projective dependency parser that searches for maximum spanning trees over directed graphs. Models of dependency structure are based on large-margin discriminative training methods. Projective parsing is also supported.
  • MUC 6
    This corpus contains the annotated Wall Street Journal articles, the scoring software and the corresponding documentation used in the MUC 6 evaluation. Both the MUC 6 Additional News Text and the MUC 6 corpus are necessary in order to replicate the evaluation. All the materials are published as received from the corpus creators, without any quality control being done at the LDC (the only difference is that the files have been uncompressed).
    »
    « : index
     
  • MXPOST
    MXPOST is a JAVA (JDK 1.1) implementation of the part-of-speech tagger described in: Adwait Ratnaparkhi. A Maximum Entropy Part-Of-Speech Tagger. In Proceedings of the Empirical Methods in Natural Language Processing Conference, May 17-18, 1996. University of Pennsylvania
  • MaltConverter
    MaltConverter is a terminal-based program for conversion between the representation format for dependency treebanks Malt-XML, Malt-TAB and TIGER-XML (NTN). It is also possible to map attribute names and tagsets.
  • MaltParser
    MaltParser is a system for data-driven dependency parsing, which can be used to induce a parsing model from treebank data and to parse new data using an induced model.
    »
    « 1.4.1: index | 1.4: index | 1.3: index
     
  • Manually Annotated Sub-Corpus First Release
  • Mate-SRL
    Semantic role labeling system. See A. Björkelund, L. Hafdell, and P. Nugues. Multilingual semantic role labeling. In Proceedings of The Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 43--48, Boulder, June 4--5 2009.
  • Maximum Entropy Toolkit
    The Maximum Entropy Toolkit provides a set of tools and library for constructing maximum entropy (maxent) model in either Python or C++.
  • Memory-based Tagger Generator and Tagger
    The tagger-generator part can generate a sequence tagger on the basis of a training set of tagged sequences; the tagger part can tag new sequences.
  • MonaSearch
    MonaSearch is a powerful query tool for linguistic treebanks.
    »
    « 0.3: index
     
  • MorphAdorner
    MorphAdorner is a Java command-line program which acts as a pipeline manager for processes performing morphological adornment of words in a text. We use the term "adornment" in preference to terms such as "annotation" or "tagging" which carry too many alternative and confusing meanings. Adornment harkens back to the medieval sense of manuscript adornment or illumination -- attaching pictures and marginal comments to texts.
  • IRST-LM
    The IRST Language Modeling Toolkit features algorithms and data structures suitable to estimate, store, and access very large LMs. Our software has been integrated into a popular open source Statistical Machine Translation decoder called Moses, and is compatible with language models created with other tools, such as the SRILM Tooolkit.
    »
    « 3712-srilm: index | 3712-irstlm: index | 2010-08-13: index | 1.6.0: index | 1.15.11: index
     
  • NEGRA
    10.000 sentences from the German newspaper "Frankfurter Rundschau", annotated with parts of speech and syntactic structures.
  • NLTK Data
    Corpora and other data used by NLTK.
    »
    « 0.9: index
     
  • NXT
    NXT is a set of libraries and tools that provide for the native representation, manipulation, query and analysis of multimedia language data.
  • Named Entity Tagger
    The Named Entity Tagger is a self-contained package which incorporates versions of SNoW and FEX, together with an inference module. It includes a network trained to recognize Person, Location, Organization and Misc. entities in English.
  • Natural Language Toolkit
    NLTK - the Natural Language Toolkit is a suite of open source Python modules, data and documentation for research and development in natural language processing. Supported tasks include, for example: tagging, chunking, chart parsing, probabilistic parsing, feature-based grammars, logical semantics, linguistic data managment and regular expressions.
    »
    « 2.0.4: index | 0.9.8: index | 0.9.5: index | 0.9: index
     
  • NomBank
    NomBank is an annotation project at New York University that is related to the PropBank project at the University of Colorado. Our goal is to mark the sets of arguments that cooccur with nouns in the PropBank Corpus (the Wall Street Journal Corpus of the Penn Treebank), just as PropBank records such information for verbs.
  • North American News Text, Complete
    The NANC is a collection of English news text from the Los Angeles Times, Washington Post, New York Times, Reuters and the Wall Street Journal.
  • OntoNotes
    The goal of the OntoNotes project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, use net, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).
    »
    « 3.0: index | 2.0: index
     
  • Open Roget's
    The Open Roget's Project sets out to create a fully functional lexical resource for Natural Language Processing based on Roget's Thesaurus. A Java 5.0 implementation with the 1911 data is now available.
    »
    « 1.1: index
     
  • Open for Questions Corpus
    This corpus consists of 11 files corresponding to the 11 categories of the "Open for Questions" event on whitehouse.gov in March of 2009. In the course of this event, Americans submitted over 100,000 questions which they wanted President Obama to answer. Each file contains close to 1,000 questions from the respective category extracted from the "Open for Questions" page.
  • OpenCCG
    OpenCCG, the OpenNLP CCG Library, is an open source natural language processing library written in Java, which provides parsing and realization services based on Mark Steedman's Combinatory Categorial Grammar (CCG) formalism.
  • OpenNLP
    OpenNLP is an organizational center for open source projects related to natural language processing. Its primary role is to encourage and facilitate the collaboration of researchers and developers on such projects.
    »
    « 1.4.3: index
     
  • OpenSubtitles
    This is a collection of movie subtitles in various languages, tokenized and aligned at the sentence level.
    »
    « 0.7: index | 0.3: index
     
  • PAN Plagiarism Corpus
    This corpus contains documents in which plagiarism has been inserted automatically and manually.
  • PaWs
    PaWs (Parser Wrappers) provides a simple Java-based wrapper for the Minipar command line interface.
  • ParseBanker
    The LFG Parsebanker Interface is a Web-based tool for building LFG treebanks (parsebanks). It includes a discriminant-based mechanism for disambiguation of parses.
  • Penn Discourse Treebank
    The Penn Discourse Treebank (PDTB) is an NSF funded project at the University of Pennsylvania. The goal of the project is to annotate the 1 million word Wall Street Journal corpus in Treebank-2 (LDC95T7) with discourse relations holding between the eventualities and propositions mentioned in text, which serve as the arguments to the relation.
  • Penn Discourse Treebank Version 2.0 Update - RTE data
    Recognizing Textual Entailment (RTE) update for the Penn Discourse Treebank 2.0
  • Penn Treebank
    The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. We have also added the dependency-converted version in CoNLL format.
  • Penn2Malt
    Penn2Malt is a terminal-based program for conversion from the representation of the Penn Treebank to the Malt-TAB format.
  • Precompiled Personalized PageRank vectors for all WordNet lemmas
    This is a collection of files which containt the probability vectors for all lemmas in WordNet version 3.0. The vectors have been produced by the ukb_ppv program (http://ixa.si.ehu.es/ukb)
  • Projekt Gutenberg
    The project Gutenberg collects texts which are in the public domain. This collection contains pieces from almost 400 different authors. All of them are in German and formatted as HTML.
  • Protégé
    Protégé is a free, open source ontology editor and knowledge-base framework.
    »
    « 3.3.1: index
     
  • PyGoogle
    This module is a wrapper for the Google Web APIs. It allows you to do Google searches, retrieve pages from the Google cache, and ask Google for spelling suggestions.
  • PyLucene
    Python extension for accessing Java Lucene
  • PySVMLight
    A Python binding to the SVM-Light support vector machine library by Thorsten Joachims.
  • RASP
    RASP is a domain-independent, robust parsing system for English.
  • RapidMiner
    RapidMiner is a freely available open-source knowledge discovery environment.
    »
    « 4.3: index
     
  • The Regensburg Parallel Corpus (German - Russian)
    The RPC is a parallel aligned corpus of translated and original belletristic texts in Slavic and some other languages, developed at the Institute of Slavistics at Regensburg University.
  • Reranking Parser
    A reranking parser which uses a regularized MaxEnt reranker to select the best parse from the 50-best parses returned by a generative parsing model.
  • ResearchCyc
    OpenCyc is the open source version of the Cyc Knowledge Base. Included with the release is a free binary version of the Cyc Knowledge Server. The Cyc Knowledge Server includes an inference engine, a knowledge base browser and an API for writing programs in other high-level languages that access and use the OpenCyc knowledge base.
    »
    « 1.0: index
     
  • Reuters Corpus
    A collection of Reuters newswire texts, sorted by months.
  • Reverend
    Reverend is a simple Bayesian classifier. It is designed to be easy to adapt and extend for your application.
  • S-Space
    The S-Space Package is a collection of algorithms for building Semantic Spaces as well as a highly-scalable library for designing new distributional semantics algorithms.
  • SALSA
    The data provided by this SALSA release add a layer of role-semantic information to TIGER (release 1), a syntactically annotated German newspaper corpus.
  • SALTO
    SALTO is a graphical tool that supports manual annotation of text corpora and annotation management.
  • SFST
    Stuttgart Finite State Transducer tools (SFST) is a toolbox for the implementation of morphological analysers and other tools which are based on finite state transducer technology.
    »
    « 1.2: index | 1.1: index
     
  • Morphisto
    Morphisto is a free morphological lexicon provided by IDS Mannheim. It is based on SFST and replaces parts of the SMOR package.
  • SMOR
    SMOR is a German finite-state morphology implemented in the SFST programming language.
  • SMS Corpus
    This is a corpus of SMS (Short Message Service) messages collected for research at the Department of Computer Science at the National University of Singapore. Currently (April 2004), the corpus consists of about 10,000 SMS messages collected by students. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available.
  • SMULTRON
    SMULTRON (Stockholm MULtilingual TReebank) is a parallel treebank developed by the Computational Linguistics Group at the Department of Linguistics, at Stockholm University. The parallel treebank contains around 1000 sentences in English, German and Swedish. The sentences have been PoS-tagged and annotated with phrase structure trees. The trees have been aligned on sentence, phrase and word level. Additionally, the German and Swedish monolingual treebanks contain lemma information.
    »
    « 1.0: index
     
  • SNoW
    The SNoW (Sparse Network of Winnows) learning architecture is a multi-class classifier that is specifically tailored for large scale learning tasks and fpr domains in which the potential number of features taking part in decisions is very large, but may be unknown a priori. It learns a sparse network of linear functions in which the targets concepts (class labels) are represented as linear functions over a common feature space.
  • SPASS
    SPASS is an automated theorem prover for first-order logic with equality.
    »
    « 3.0: index
     
  • SVDLIBC
    SVDLIBC is a C library based on the SVDPACKC library. SVDLIBC offers a cleaned-up version of the code with a sane library interface and a front-end executable that performs matrix file type conversions, along with computing singular value decompositions. Currently the only SVDPACKC algorithm implemented in SVDLIBC is las2, because it seems to be consistently the fastest. This algorithm has the drawback that the low order singular values may be relatively imprecise, but that is not a problem for most users who only want the higher-order values or who can tolerate some imprecision.
  • SemCor
    The SemCor corpus, created by the Princeton University, is a subset of the English Brown corpus containing almost 700,000 running words, all are tagged by PoS, and more than 200,000 content words are also lemmatized and sense-tagged according to Princeton WordNet.
    »
    « 2.1: index
     
  • SemEval 2010 Task 10: Linking Events and their Participants in Discourse
    This is the trial, training and testing data from task 10 of SemEval 2010. The training set for both tasks will be annotated with gold standard semantic argument structure and linking information for null instantiations. We annotate the semantic argument structures both in FrameNet and PropBank style.
    »
    « 2010: index | 1.0: index
     
  • SemEval-2007 Task 17: English Lexical Sample, SRL and All Words
  • SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages
    This SemEval-2010 Task 1 release contains approximately 120,000 words extracted from the OntoNotes corpus and formatted for the SemEval task.
  • SemEval-2013 Task 3: Spatial Role Labeling
  • SemLink
    SemLink is a project whose aim is to link together different lexical resources via a set of mappings. These mappings will make it possible to combine the different information provided by these different lexical resources for tasks such as inferencing.
  • Semafor
    SEMAFOR: Semantic Analysis of Frame Representations is a tool for automatic analysis of the frame-semantic structure of English text.
  • Semantic Vectors
    Semantic Vector indexes, created by applying a Random Projection algorithm to term-document matrices created using Apache Lucene. The package was created as part of a project by the University of Pittsburgh Office of Technology Management, to explore the potential for automatically matching related concepts in them technology management domain, e.g., mapping new technologies to potentatially interested licensors.
    »
    « 1.10: index
     
  • SenseLearner
    The goal of the SenseLearner project is to conduct exploratory research of various WSD techniques to enable the development of a tool for semantic tagging of all words in unrestricted text.
  • Senseval 3 -- Task 6 (English Lexical Sample)
    The goal of this task is to create a framework for the evaluation of systems that perform Word Sense Disambiguation. By the time Senseval-3 will take place, we estimate to have enough data for about 60 ambiguous nouns, adjectives, and verbs.
  • SentiWordNet
    SentiWordNet is a lexical resource for opinion mining. SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity.
    »
    « 1.0.1: index
     
  • Shalmaneser
    Shalmaneser is a supervised learning toolbox for shallow semantic parsing, i.e. the automatic assignment of semantic classes and roles to text. The system was developed for Frame Semantics; thus we use Frame Semantics terminology and call the classes frames and the roles frame elements. However, the architecture is reasonably general: It can handle any role-semantic paradigm (e.g., PropBank roles) and any set of word senses (e.g., WordNet synsets), provided the input data is offered in SalsaTigerXML.
  • Sleepy Student Parser
    'Sleepy' is a simple unlexicalized parser for German, returning both syntactic category and grammatical function labels in the tree. It will not be able to parse some sentences - coverage is only 93% on newspaper text.
  • Stanford CoreNLP
    Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities. It provides the foundational building blocks for higher level text understanding applications.
  • Stanford POS Tagger
    This software is a Java implementation of the log-linear part-of-speech taggers described in: Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70. Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259.
    »
    « 2.0: index | 1.6: index
     
  • Stanford Parser
    This package is a Java implementation of probabilistic natural language parsers, both highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG parser. The original version of this parser was mainly written by Dan Klein, with support code and linguistic grammar development by Christopher Manning. Extensive additional work (internationalization and language-specific modeling, flexible input/output, grammar compaction, lattice parsing, typed dependencies output, user support, etc.) has been done by Roger Levy, Christopher Manning, Teg Grenager, Galen Andrew, Marie-Catherine de Marneffe, Bill MacCartney, Huihsin Tseng, Pi-Chuan Chang, Wolfgang Maier, and Jenny Finkel.
    »
    « 1.6.3: index | 1.6: index
     
  • Stanford Named Entity Recognizer
    CRFClassifier is a Java implementation of a Named Entity Recognizer. The software provides an implementation of Conditional Random Field sequence models, of the sort pioneered by Lafferty, McCallum, and Pereira (2001), coupled with well-engineered feature extractors for Named Entity Recognition.
    »
    « 1.1.1: index | 1.1: index
     
  • Stockholm TreeAligner
    The Stockholm TreeAligner allows you to create alignment links between corresponding nodes (or words) in two treebanks in different languages.
  • Strudel
    Distributional representation of English words a la Baroni & Lenci
  • Susanne
    The SUSANNE scheme attempts to provide a method of representing all aspects of English grammar which are sufficiently definite to be susceptible of formal annotation, with the categories and boundaries between categories specified in sufficient detail that, ideally, two analysts independently annotating the same text and referring to the same scheme must produce the same structural analysis.
  • Synpathy
    Synpathy is a tool for annotating, analyzing, and graphically editing the syntactical structure of sentences (e.g. linguisticly annotated text corpora), developed at the Max Planck Institute for Psycholinguistics, Nijmegen, the Netherlands.
  • Szeged Corpus
    Szeged Corpus 2.0, the extension of the first version of the corpus, is a morpho-syntactically analyzed and manually annotated natural language database. It is not only bigger than the first version but apart from contextually selected morpho-syntactic codes, the database also contains the possible codes, so that it is efficiently applicable to the testing of automatic grammatical category annotating methods. The corpus consists of 1.2 million word entries, which cover 155.500 different word forms, and also contains further 250 thousand punctuation marks. Corpus files are available in XML-format, their inner structure is described by the TEIxLite DTD (Document Type Definition) scheme.
    »
    « 2.0: index
     
  • TANGO
    TANGO is an annotation tool used to annotate TLINKS, ALINKS and SLINKS according to the TimeML specification.
  • TERN
    This release contains the English training data prepared for the 2004 Time Expression Recognition and Normalization (TERN) Evaluation, sponsored by the Automatic Content Extraction (ACE) program.
  • TIGER
    The TIGER Treebank is a corpus of 40.000 syntactically annotated German newspaper sentences. The annotation scheme used is an extended and improved version of the NEGRA annotation scheme. The conll06-train+test directory contains the dependency-converted corpus used in the CoNLL 2006 Shared Task. We have also added a dependency version which was converted with the pennconverter (default setting; directory dependency-converted), but you will probably want to use the CoNLL06 data.
  • TIGERSearch
    TIGERSearch is a specialized search engine for retrieving information from annotated corpora.
  • Tarsqi Toolkit
    The Tarsqi Toolkit (TTK) is a set of components for extracting temporal information from a news wire text. TTK extracts time expressions, events, subordination links and temporal links; in addition, it ensures consistency of temporal information.
  • The New York Times Annotated Corpus
    The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com.
  • The Tübingen Treebank of Written German
    The TüBa-D/Z treebank is a syntactically annotated, German newspaper corpus based on data taken from the daily issues of 'die tageszeitung' (taz).
    »
    « 5: index | 4: index | 3: index
     
  • Theorist
    A compiler from Theorist into Prolog (it has been tested on both Sicstus and Quintus Prologs), and many example Theorist programs, including most of the standard nonmonotonic (but very monotonous) examples, diagnostic examples, and examples of scene interpretation.
  • TiMBL
    Tilburg Memory Based Learner
    »
    « 6.2.1: index | 6.1.2: index
     
  • TimeBank
    TimeML aims to capture and represent temporal information. This is accomplished using four primary tag types: TIMEX3 for temporal expressions, EVENT for temporal events, SIGNAL for temporal signals, and LINK for representing relationships.
    »
    « 1.1: index
     
  • TinySVM
    TinySVM is an implementation of Support Vector Machines (SVMs) for the problem of pattern recognition. Support Vector Machines is a new generation learning algorithms based on recent advances in statistical learning theory, and applied to large number of real-world applications, such as text categorization, hand-written character recognition.
  • ToscanaJ
    The ToscanaJ project is a collaboration between DSTC, the University of Queensland and the Technical University of Darmstadt to recreate a classic Formal Concept Analysis tool called Toscana and to give the FCA community a platform to work with.
  • TrEd
    TrEd is a fully customizable and programmable graphical editor and viewer for tree-like structures. Among other projects, it was used as the main annotation tool for syntactical and tectogrammatical annotations in The Prague Dependency Treebank, as well as for decision-tree based morphological annotation of The Prague Arabic Dependency Treebank.
    »
    « 1.4206: index
     
  • TreeTagger
    The TreeTagger is a tool for annotating text with part-of-speech and lemma information which has been developed within the TC project at the Institute for Computational Linguistics of the University of Stuttgart.
  • Twitter data set
    A slightly cleaned up version of the Twitter data gathered through the Twitter's streaming API (http://stream.twitter.com/). The data is released under the Creative Commons license.
  • TypeDM
    Distributional Memory: A general framework for corpus-based semantics
  • UKB
    UKB
    »
    « 0.1.5: index | 0.1.3: index | 0.1.0: index
     
  • UMD Death Penalty Corpus
    The Death Penalty Corpus is a collection of material from Web sites that express views for and against the death penalty.
  • UN Corpora
    The corpus is a paragraph-aligned six-language collection of resolutions of the General Assembly from Volume I of GA regular sessions 55-62. The corpus is described in an academic paper that will be presented (as a poster) at Machine Translation Summit XII on August 28th, 2009.
  • Unified Linguistic Annotation Text Collection
    The Unified Linguistic Annotation (ULA) project seeks to integrate into one framework different layers of annotation (e.g., semantics, discourse, temporal, opinions) using various existing resources, including PropBank, NomBank, TimeBank, Penn Discourse Treebank and coreference and opinion annotations. The Unified Linguistic Annotation Text Collection consists of two separate corpora: The Language Understanding Annotation Corpus (LDC2009T10) and REFLEX EntityTranslation Training/DevTest (LDC2009T11).
    »
    « 1.0: index
     
  • The Universal Declaration of Human Rights
    The Universal Declaration of Human Rights in over 300 different languages. All declarations are taken from http://www.unhchr.ch/udhr/navigate/alpha.htm. They have been downloaded and converted by the script udhr-get.py.
  • VICO Social Media Forum-Korpus
    Jeweils 100.000 Beiträge u den Themen Gesundheit und PC (Anwendungen) ausverschiedenen deutschsprachigen Webforen, inklusive Metainformationen (thread, posting date, ...)
  • VerbNet
    VerbNet (VN) is the largest on-line verb lexicon currently available for English. It is a hierarchical domain-independent, broad-coverage verb lexicon with mappings to other lexical resources such as WordNet, Xtag, and FrameNet.
    »
    « 3.0: index | 2.3: index
     
  • WFSC
    WFSC compiles regular expressions into multi-tape weighted finite-state machines (n-WFSMs) with symbol classes. These machines define regular (also called rational) n-ary relations which assign a weight from some semiring to any n-tuple of strings (0 if the n-tuple is not accepted). Special cases of n-WFSMs are weighted acceptors (n=1) and weighted transducer (n=2).
  • ukWaC
    The UK Web Archive contains websites that publish research, that reflect the diversity of lives, interests and activities throughout the UK, and demonstrate web innovation.
  • WaCTK
    The Web as Corpus Toolkit (WaCTK) is a collection of programs that can be used to create a (large) text corpus from a list of URLs. The corpus can then be used for linguistic purposes or for lexicography. While it is questionable whether you are allowed to distribute a corpus of web pages you do are not the copyright holder of, it is much easier to distribute only pointers to all those pages - a list of URLs.
  • WaCkypedia
    A 2009 dump of the English Wikipedia (about 800 million tokens), in the same format as PukWaC, including POS/lemma information, as well as a full dependency parse (parsing performed with the MaltParser).
  • deWac
    A 1.7 billion word corpus constructed from the Web limiting the crawl to the .de domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds.
  • frWac
    A 1.6 billion word corpus constructed from the Web limiting the crawl to the .fr domain and using medium-frequency words from the Le Monde Diplomatique corpus and basic French vocabulary lists as seeds.
  • pukWaC
    The same as ukWaC, a 2 billion word corpus acquired from the .uk domain, but with a further layer of annotation added, i.e. a full dependency parse. The parsing was performed with the MaltParser.
  • Web 1T 5-gram, 10 European Languages
    Web 1T 5-gram, 10 European Languages Version 1 was created by Google, Inc. It consists of word n-grams and their observed frequency counts for ten European languages: Czech, Dutch, French, German, Italian, Polish, Portuguese, Romanian, Spanish and Swedish. The length of the n-grams ranges from unigrams (single words) to five-grams. The n-gram counts were generated from approximately one hundred billion word tokens of text for each language, or approximately one trillion total tokens.
  • Web 1t 5-gram Corpus
    This data set contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. Data collection took place in January 2006. This means that no text that was created on or after February 1, 2006 was used.
  • Weka
    Weka is a collection of machine learning algorithms for data mining tasks.
    »
    « 3.6.1: index
     
  • WikiXML
    WikiXML is a collection of Wikipedia articles converted to XML format.
  • Wikipedia-Similarity
    Wikipedia-Similarity is a Java library for querying Wikipedia and compute relatedness between words and phrases.
  • WordNet
    WordNet(R) is a large lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.
    »
    « 2.1: index | 2.0: index | 1.7.1: index | 1.5: index | 1.0: index
     
  • Leipzig Corpora Collection / Wortschatz
    The Leipzig Corpora Collection presents corpora in different languages using the same format and comparable sources. The sources are either newspaper texts or texts randomly collected from the web. The texts are split into sentences. Non-sentences and foreign language material was removed.
  • XFST
    Xerox finite-state tool (XFST) is a general-purpose utility for computing with finite-state networks. It enables the user to create simple automata and transducers from text and binary files, regular expressions and other networks by a variety of operations. The user can display, examine and modify the structure and the content of the networks. The result can be saved as text or binary files.
  • XLE
    XLE consists of algorithms for parsing and generating Lexical Functional Grammars (LFGs) along with a rich graphical user interface for writing and debugging such grammars.
    »
    « 2010-10-06: index | 2010-02-19: index | 2009-09-18: index | 2009-08-12: index | 2009-01-21: index | 2008-10-27: index | 2008-08-28, 64bit: index | 2008-08-28: index | 2008-02-25: index | 2007-04: index | 18. April 2008: index
     
  • XRay
    XRay is a Prolog-technology theorem prover for reasoning from incomplete information; it is based on an approach to query-answering in default logics described in Schaub (1995).
  • YAGO
    YAGO is a huge semantic knowledge base. Currently, YAGO knows more than 2 million entities (like persons, organizations, cities, etc.). It knows 20 million facts about these entities. Unlike many other automatically assembled knowledge bases, YAGO has a manually confirmed accuracy of 95%.
  • Yahoo! Answers Comprehensive Questions and Answers
    Yahoo Webscope Dataset L6
  • Yahoo! Answers Manner Questions
    Yahoo Webscope Dataset L5
  • Yahoo! Answers Question Types
    Yahoo Webscope Dataset L7
  • Yahoo! Answers Search Query Logs for Nine Languages
    Yahoo Webscope Dataset L8
  • Yahoo! Learning to Rank Challenge
    Yahoo Webscope Dataset L14
  • YamCha
    YamCha (Yet Another Multipurpose CHunk Annotator) is a generic, customizable, and open source text chunker oriented toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking. YamCha is using a state-of-the-art machine learning algorithm called Support Vector Machines (SVMs), first introduced by Vapnik in 1995.
  • Die Zeit online
    News appeared on the website of "Die Zeit", a German weekly magazin. The articles (1999-2001) have been retrieved from the webpage.
  • Cognates
    List of English-German identical cognates with POS tags extracted from BNC (English) / HGC (German)
  • crf++
    CRF++ is a simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data.
  • Dict.cc
    Publicly available dict.cc German-English dictionary
  • gensim
    Gensim is a free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.
  • musiXmatch dataset
    The MXM dataset provides lyrics for many MSD tracks. The lyrics come in bag-of-words format: each track is described as the word-counts for a dictionary of the top 5,000 words across the set.
  • sdewac
    A 0.88 billion word corpus derived from deWaC, duplicate sentences and some noise have been removed. The corpus has been converted to Unicode. SdeWaC comes in two versions, in POS-tagged / lemmatized version or as a one sentence per line format, each supplemented with metadata (e.g. parse error rate).