Welcome to the Department of Computational Linguistics' resources home page!
These pages document all the resources available at our department.
If you are a member, please login with the general coli
account to access all the information available (on the top right). If
you have any technical problems, please look for an existing ticket concerning your problem or open a new ticket.
If you have general questions, contact us at resources@cl.uni-heidelberg.de.
Willkommen auf den Ressourcenseiten des Seminars für Computerlinguistik!
Auf diesen Seiten sind alle am Institut verfügbaren
Ressourcen dokumentiert. Wenn Sie Mitglied des SCL sind, loggen Sie sich
bitte mit dem allgemeinen coli-Account ein, um Zugang zu allen
verfügbaren Informationen zu erhalten (rechts oben). Bei Fragen oder Problemen bitte nach existierenden Tickets suchen oder ein neues Ticket anlegen. Bei allgemeinen Fragen schreiben Sie uns eine E-Mail an resources@cl.uni-heidelberg.de.
Resources /
Sub categories
Resources
-
(Baked) Strudel
-
2005 NIST Speaker Recognition Evaluation Training Data2005 NIST Speaker Recognition Evaluation Training Data consists of 392 hours of conversational telephone speech in English, Arabic, Mandarin Chinese, Russian and Spanish and associated English transcripts used as training data in the NIST-sponsored 2005 Speaker Recognition Evaluation (SRE).
-
2006 NIST Spoken Term Detection Development Set
-
2008 CoNLL Shared Task DataThe 2008 CoNLL Shared Task Data contains the the trial corpus, training corpus, development and test data for the 2008 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation. The 2008 Shared Task developed syntactic dependency annotations, including information such as named-entity boundaries and the semantic dependencies model roles of both verbal and nominal predicates. The materials in the Shared Task data consist of excerpts from the following corpora: Treebank-3 LDC99T42 , BBN Pronoun Coreference and Entity Type Corpus LDC2005T33, Proposition Bank I LDC2004T14 (PropBank) and NomBank v 1.0 LDC2008T23.
-
ACE 2005 English SpatialML Annotations
-
ACE-2The objective of the ACE program is to develop extraction technology to support automatic processing of source language data (in the form of natural text, and as text derived from ASR and OCR). This includes classification, filtering, and selection based on the language content of the source data, i.e., based on the meaning conveyed by the data.
-
ACL Anthology Reference Corpus
-
AOL-DATA
-
AQUAINT-2AQUAINT-2 Information-Retrieval Text Research Collection, Linguistic Data Consortium (LDC) catalog number LDC2008T25 and ISBN 1-58563-494-8, was developed by LDC for NIST's (National Institute for Standards and Technology) AQUAINT 2007 Question-Answer (QA) track. It consists of approximately 2.5 GB of English news text from six distinct sources collected by LDC (Agence France Presse, Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington Post, New York Times and Xinhua News Agency) covering the period from October 2004 through March 2006. The AQUAINT-2 collection is the second part of a series intended to provide data useful for developing, evaluating and testing information extraction and retrieval systems. It follows the publication of The AQUAINT Corpus of English News Text (LDC2002T31).
-
ASSERTASSERT is an automatic statistical semantic role tagger, that can annotate naturally occuring text with semantic arguments. When presented with a sentence, it performs a full syntactic analysis of the sentence, automatically identifies all the verb predicates in that sentence, extracts features for all constituents in the parse tree relative to the predicate, and identifies and tags the constituents with the appropriate semantic arguments.
-
Alchemy
-
Amazon Multi-Domain Sentiment Dataset
-
American National Corpus
-
BART
-
British Academic Written English CorpusThe BAWE corpus contains 2761 pieces of proficient assessed student writing, ranging in length from about 500 words to about 5000 words. Holdings are fairly evenly distributed across four broad disciplinary areas (Arts and Humanities, Social Sciences, Life Sciences and Physical Sciences) and across four levels of study (undergraduate and taught masters level). Thirty-five disciplines are represented. The assignments have been annotated using a system devised in accordance with the TEI guidelines.
-
BBN Pronoun Coreference and Entity Type Corpus
-
BLLIP NANC TreebankThe BLLIP NANC corpus contains a Penn Treebank-style parsing of approximately 24 million sentences from the North American News Text Corpus (LDC95T21). The North American News Text Corpus consists of English news text from the Los Angeles Times-Washington Post (1994-1997), the New York Times (1994-1996), Reuters News Service (1994-1996) and the Wall Street Journal (1994-1996).
-
Banjo
-
Berkeley Aligner
-
Berkeley Parser
-
Bohnet
-
Brill Tagger
-
British National Corpus
-
Buckwalter Arabic Morphological Analyzer
-
C&C
-
CCGBank
-
CDG
-
CELEX2
-
CMU SLM Toolkit
-
CORPS
-
CQP
-
Chinese Proposition Bank
-
Chinese Treebank
-
Cluto
-
CoNLL 2011 ST data set
-
CoNLL 2012 ST data set
-
CoNLL NER
-
CoNLL SRL
-
Collins Parser
-
Concept Explorer
-
ConceptNet
-
CorScorer
-
CrowdFlowerCrowdFlower is a web interface for designing crowdsourcing tasks. Annotations/judgments can either be ordered from Amazon Mechanical Turk or the like, or you recruit the contributors yourself by providing them with a link to the task. Contact resources@cl... if you want to use our ICL account.
-
Distributional Memory
-
Dan Bikels ParserThe software is an extensible, parallel parsing engine that accommodates many different types of generative, statistical parsing models (including an emulation of Mike Collins's parsing model with equally good performance), and can easily be extended to new domains and new languages.
-
Datasets for Generic Relation Extraction (reACE)
-
Dependency-parsed British National Corpus
-
ECLiPSe
-
Enron News Corpus
-
Europarl
-
Europarl
-
Extended WordNet
-
Extracting syntactically constrained paraphrases
-
FrameNetThe Berkeley FrameNet project is creating an on-line lexical resource for English, based on frame semantics and supported by corpus evidence. The aim is to document the range of semantic and syntactic combinatory possibilities (valences) of each word in each of its senses, through computer-assisted annotation of example sentences and automatic tabulation and display of the annotation results.
-
FreebaseFreebase is an open, shared database that contains structured information on millions of topics in hundreds of categories. This information is compiled from open datasets like Wikipedia, MusicBrainz, the Securities and Exchange Commission, and the CIA World Fact Book, as well as contributions from our user community.
-
GADeL
-
GALE Phase 1 Arabic Blog Parallel TextBlogs are posts to informal web-based journals of varying topical content. GALE Phase 1 Arabic Blog Parallel Text was prepared by the LDC and consists of 102K words (222 files) of Arabic blog text and its English translation from thirty-three sources. This release was used as training data in Phase 1 of the DARPA-funded GALE program. (LDC2008T02)
-
GATE
-
GIZA++
-
GWSDGWSD is a system for Unsupervised Graph-based All-Words Word Sense Disambiguation. Please refer to (Sinha and Mihalcea, 2007) for a description of the graph-based disambiguation method, as well as for brief descriptions of all the similarity measures and the graph-centrality algorithms used by GWSD.
-
Geobase
-
GermaNet
-
GermaNet API in Java
-
GermaNet Perl API
-
German Topological Parser
-
GibbsLDA++GibbsLDA++ is a C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference. It is very fast and is designed to analyze hidden/latent topic structures of large-scale datasets including large collections of text/Web documents. Note: This one does not run on ella.
-
English Gigaword 5th Edition
-
HCRC Map Task Corpus
-
HILDA
-
HTML Parser
-
Hadoop
-
Heart of Gold
-
Heise-Newsticker Meldungen
-
IBM LanguageWare Resource WorkbenchIBM® LanguageWare® is a set of run-time libraries and an easy-to-use Eclipse-based development environment for building custom text analyzers in various languages. The tools make it easy to build dictionaries, ontologies, and rules for identifying key information, relationships and meaning.
-
IceNLP
-
Indri
-
JGraphT
-
JNET
-
JRC-Acquis
-
JSBD
-
JULIE Token Boundary Detector (JTBD)The JULIE Lab Sentence Boundary Detector (JSBD) and the JULIE Lab Token Boundary Detector (JTBD) are machine learning-based tools, developed and optimized for handling life science documents containing many tricky cases which many other, especially rule-based tools, don't handle appropriately.
-
JWN-Similarity
-
JWPL
-
Java FrameNet APIThe FrameNet API can be used to access the FrameNet database and parts of the annotated corpus. One can retrieve information about frames and frame elements, follow the different frame relations, "realize" frames and frame elements by linking them with text, map them, ...
-
JavaRAPJavaRAP is an implementation of the classic Resolution of Anaphora Procedure (RAP) given by Lappin and Leass (1994) . It resolves third person pronouns, lexical anaphors, and identifies pleonastic pronouns. The original purpose of the implementation is to provide anaphora resolution result to our TREC 2003 Q&A system.
-
Jena
-
LBJ NER Tagger
-
LKBThe LKB system is a grammar and lexicon development environment for use with unification-based linguistic formalisms. While not restricted to HPSG, the LKB implements the DELPH-IN reference formalism of typed feature structures (jointly with other DELPH-IN software using the same formalism).
-
LT-TTT2
-
LibSVM
-
LingPipe
-
Link Grammar Parser
-
LoPar
-
Lucene
-
MALLET
-
MINIPARMINIPAR is a broad-coverage parser for the English language. An evaluation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. MINIPAR is very efficient, on a Pentium II 300 with 128MB memory, it parses about 300 words per second.
-
MIT Java WordNet Interface
-
MMAX2
-
MSLR
-
MSTParser
-
MUC 6This corpus contains the annotated Wall Street Journal articles, the scoring software and the corresponding documentation used in the MUC 6 evaluation. Both the MUC 6 Additional News Text and the MUC 6 corpus are necessary in order to replicate the evaluation. All the materials are published as received from the corpus creators, without any quality control being done at the LDC (the only difference is that the files have been uncompressed).
-
MXPOST
-
MaltConverter
-
MaltParser
-
Manually Annotated Sub-Corpus First Release
-
Mate-SRL
-
Maximum Entropy Toolkit
-
Memory-based Tagger Generator and Tagger
-
MonaSearch
-
MorphAdornerMorphAdorner is a Java command-line program which acts as a pipeline manager for processes performing morphological adornment of words in a text. We use the term "adornment" in preference to terms such as "annotation" or "tagging" which carry too many alternative and confusing meanings. Adornment harkens back to the medieval sense of manuscript adornment or illumination -- attaching pictures and marginal comments to texts.
-
IRST-LMThe IRST Language Modeling Toolkit features algorithms and data structures suitable to estimate, store, and access very large LMs. Our software has been integrated into a popular open source Statistical Machine Translation decoder called Moses, and is compatible with language models created with other tools, such as the SRILM Tooolkit.
-
NEGRA
-
NLTK Data
-
NXT
-
Named Entity Tagger
-
Natural Language ToolkitNLTK - the Natural Language Toolkit is a suite of open source Python modules, data and documentation for research and development in natural language processing. Supported tasks include, for example: tagging, chunking, chart parsing, probabilistic parsing, feature-based grammars, logical semantics, linguistic data managment and regular expressions.
-
NomBankNomBank is an annotation project at New York University that is related to the PropBank project at the University of Colorado. Our goal is to mark the sets of arguments that cooccur with nouns in the PropBank Corpus (the Wall Street Journal Corpus of the Penn Treebank), just as PropBank records such information for verbs.
-
North American News Text, Complete
-
OntoNotesThe goal of the OntoNotes project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, use net, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).
-
Open Roget's
-
Open for Questions CorpusThis corpus consists of 11 files corresponding to the 11 categories of the "Open for Questions" event on whitehouse.gov in March of 2009. In the course of this event, Americans submitted over 100,000 questions which they wanted President Obama to answer. Each file contains close to 1,000 questions from the respective category extracted from the "Open for Questions" page.
-
OpenCCG
-
OpenNLP
-
OpenSubtitles
-
PAN Plagiarism Corpus
-
PaWs
-
ParseBanker
-
Penn Discourse TreebankThe Penn Discourse Treebank (PDTB) is an NSF funded project at the University of Pennsylvania. The goal of the project is to annotate the 1 million word Wall Street Journal corpus in Treebank-2 (LDC95T7) with discourse relations holding between the eventualities and propositions mentioned in text, which serve as the arguments to the relation.
-
Penn Discourse Treebank Version 2.0 Update - RTE data
-
Penn Treebank
-
Penn2Malt
-
Precompiled Personalized PageRank vectors for all WordNet lemmas
-
Projekt Gutenberg
-
Protégé
-
PyGoogle
-
PyLucene
-
PySVMLight
-
RASP
-
RapidMiner
-
The Regensburg Parallel Corpus (German - Russian)
-
Reranking Parser
-
ResearchCycOpenCyc is the open source version of the Cyc Knowledge Base. Included with the release is a free binary version of the Cyc Knowledge Server. The Cyc Knowledge Server includes an inference engine, a knowledge base browser and an API for writing programs in other high-level languages that access and use the OpenCyc knowledge base.
-
Reuters Corpus
-
Reverend
-
S-Space
-
SALSA
-
SALTO
-
SFST
-
Morphisto
-
SMOR
-
SMS CorpusThis is a corpus of SMS (Short Message Service) messages collected for research at the Department of Computer Science at the National University of Singapore. Currently (April 2004), the corpus consists of about 10,000 SMS messages collected by students. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available.
-
SMULTRONSMULTRON (Stockholm MULtilingual TReebank) is a parallel treebank developed by the Computational Linguistics Group at the Department of Linguistics, at Stockholm University. The parallel treebank contains around 1000 sentences in English, German and Swedish. The sentences have been PoS-tagged and annotated with phrase structure trees. The trees have been aligned on sentence, phrase and word level. Additionally, the German and Swedish monolingual treebanks contain lemma information.
-
SNoWThe SNoW (Sparse Network of Winnows) learning architecture is a multi-class classifier that is specifically tailored for large scale learning tasks and fpr domains in which the potential number of features taking part in decisions is very large, but may be unknown a priori. It learns a sparse network of linear functions in which the targets concepts (class labels) are represented as linear functions over a common feature space.
-
SPASS
-
SVDLIBCSVDLIBC is a C library based on the SVDPACKC library. SVDLIBC offers a cleaned-up version of the code with a sane library interface and a front-end executable that performs matrix file type conversions, along with computing singular value decompositions. Currently the only SVDPACKC algorithm implemented in SVDLIBC is las2, because it seems to be consistently the fastest. This algorithm has the drawback that the low order singular values may be relatively imprecise, but that is not a problem for most users who only want the higher-order values or who can tolerate some imprecision.
-
SemCor
-
SemEval 2010 Task 10: Linking Events and their Participants in DiscourseThis is the trial, training and testing data from task 10 of SemEval 2010. The training set for both tasks will be annotated with gold standard semantic argument structure and linking information for null instantiations. We annotate the semantic argument structures both in FrameNet and PropBank style.
-
SemEval-2007 Task 17: English Lexical Sample, SRL and All Words
-
SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages
-
SemEval-2013 Task 3: Spatial Role Labeling
-
SemLink
-
Semafor
-
Semantic VectorsSemantic Vector indexes, created by applying a Random Projection algorithm to term-document matrices created using Apache Lucene. The package was created as part of a project by the University of Pittsburgh Office of Technology Management, to explore the potential for automatically matching related concepts in them technology management domain, e.g., mapping new technologies to potentatially interested licensors.
-
SenseLearner
-
Senseval 3 -- Task 6 (English Lexical Sample)
-
SentiWordNet
-
ShalmaneserShalmaneser is a supervised learning toolbox for shallow semantic parsing, i.e. the automatic assignment of semantic classes and roles to text. The system was developed for Frame Semantics; thus we use Frame Semantics terminology and call the classes frames and the roles frame elements. However, the architecture is reasonably general: It can handle any role-semantic paradigm (e.g., PropBank roles) and any set of word senses (e.g., WordNet synsets), provided the input data is offered in SalsaTigerXML.
-
Sleepy Student Parser
-
Stanford CoreNLPStanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities. It provides the foundational building blocks for higher level text understanding applications.
-
Stanford POS TaggerThis software is a Java implementation of the log-linear part-of-speech taggers described in: Kristina Toutanova and Christopher D. Manning. 2000. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63-70. Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259.
-
Stanford ParserThis package is a Java implementation of probabilistic natural language parsers, both highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG parser. The original version of this parser was mainly written by Dan Klein, with support code and linguistic grammar development by Christopher Manning. Extensive additional work (internationalization and language-specific modeling, flexible input/output, grammar compaction, lattice parsing, typed dependencies output, user support, etc.) has been done by Roger Levy, Christopher Manning, Teg Grenager, Galen Andrew, Marie-Catherine de Marneffe, Bill MacCartney, Huihsin Tseng, Pi-Chuan Chang, Wolfgang Maier, and Jenny Finkel.
-
Stanford Named Entity RecognizerCRFClassifier is a Java implementation of a Named Entity Recognizer. The software provides an implementation of Conditional Random Field sequence models, of the sort pioneered by Lafferty, McCallum, and Pereira (2001), coupled with well-engineered feature extractors for Named Entity Recognition.
-
Stockholm TreeAligner
-
Strudel
-
SusanneThe SUSANNE scheme attempts to provide a method of representing all aspects of English grammar which are sufficiently definite to be susceptible of formal annotation, with the categories and boundaries between categories specified in sufficient detail that, ideally, two analysts independently annotating the same text and referring to the same scheme must produce the same structural analysis.
-
Synpathy
-
Szeged CorpusSzeged Corpus 2.0, the extension of the first version of the corpus, is a morpho-syntactically analyzed and manually annotated natural language database. It is not only bigger than the first version but apart from contextually selected morpho-syntactic codes, the database also contains the possible codes, so that it is efficiently applicable to the testing of automatic grammatical category annotating methods. The corpus consists of 1.2 million word entries, which cover 155.500 different word forms, and also contains further 250 thousand punctuation marks. Corpus files are available in XML-format, their inner structure is described by the TEIxLite DTD (Document Type Definition) scheme.
-
TANGO
-
TERN
-
TIGERThe TIGER Treebank is a corpus of 40.000 syntactically annotated German newspaper sentences. The annotation scheme used is an extended and improved version of the NEGRA annotation scheme. The conll06-train+test directory contains the dependency-converted corpus used in the CoNLL 2006 Shared Task. We have also added a dependency version which was converted with the pennconverter (default setting; directory dependency-converted), but you will probably want to use the CoNLL06 data.
-
TIGERSearch
-
Tarsqi Toolkit
-
The New York Times Annotated CorpusThe New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com.
-
The Tübingen Treebank of Written German
-
Theorist
-
TiMBL
-
TimeBank
-
TinySVMTinySVM is an implementation of Support Vector Machines (SVMs) for the problem of pattern recognition. Support Vector Machines is a new generation learning algorithms based on recent advances in statistical learning theory, and applied to large number of real-world applications, such as text categorization, hand-written character recognition.
-
ToscanaJ
-
TrEdTrEd is a fully customizable and programmable graphical editor and viewer for tree-like structures. Among other projects, it was used as the main annotation tool for syntactical and tectogrammatical annotations in The Prague Dependency Treebank, as well as for decision-tree based morphological annotation of The Prague Arabic Dependency Treebank.
-
TreeTagger
-
Twitter data set
-
TypeDM
-
UKB
-
UMD Death Penalty Corpus
-
UN Corpora
-
Unified Linguistic Annotation Text CollectionThe Unified Linguistic Annotation (ULA) project seeks to integrate into one framework different layers of annotation (e.g., semantics, discourse, temporal, opinions) using various existing resources, including PropBank, NomBank, TimeBank, Penn Discourse Treebank and coreference and opinion annotations. The Unified Linguistic Annotation Text Collection consists of two separate corpora: The Language Understanding Annotation Corpus (LDC2009T10) and REFLEX EntityTranslation Training/DevTest (LDC2009T11).
-
The Universal Declaration of Human Rights
-
VICO Social Media Forum-Korpus
-
VerbNet
-
WFSCWFSC compiles regular expressions into multi-tape weighted finite-state machines (n-WFSMs) with symbol classes. These machines define regular (also called rational) n-ary relations which assign a weight from some semiring to any n-tuple of strings (0 if the n-tuple is not accepted). Special cases of n-WFSMs are weighted acceptors (n=1) and weighted transducer (n=2).
-
ukWaC
-
WaCTKThe Web as Corpus Toolkit (WaCTK) is a collection of programs that can be used to create a (large) text corpus from a list of URLs. The corpus can then be used for linguistic purposes or for lexicography. While it is questionable whether you are allowed to distribute a corpus of web pages you do are not the copyright holder of, it is much easier to distribute only pointers to all those pages - a list of URLs.
-
WaCkypedia
-
deWac
-
frWac
-
pukWaC
-
Web 1T 5-gram, 10 European LanguagesWeb 1T 5-gram, 10 European Languages Version 1 was created by Google, Inc. It consists of word n-grams and their observed frequency counts for ten European languages: Czech, Dutch, French, German, Italian, Polish, Portuguese, Romanian, Spanish and Swedish. The length of the n-grams ranges from unigrams (single words) to five-grams. The n-gram counts were generated from approximately one hundred billion word tokens of text for each language, or approximately one trillion total tokens.
-
Web 1t 5-gram Corpus
-
Weka
-
WikiXML
-
Wikipedia-Similarity
-
WordNetWordNet(R) is a large lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.
-
Leipzig Corpora Collection / WortschatzThe Leipzig Corpora Collection presents corpora in different languages using the same format and comparable sources. The sources are either newspaper texts or texts randomly collected from the web. The texts are split into sentences. Non-sentences and foreign language material was removed.
-
XFSTXerox finite-state tool (XFST) is a general-purpose utility for computing with finite-state networks. It enables the user to create simple automata and transducers from text and binary files, regular expressions and other networks by a variety of operations. The user can display, examine and modify the structure and the content of the networks. The result can be saved as text or binary files.
-
XLEXLE consists of algorithms for parsing and generating Lexical Functional Grammars (LFGs) along with a rich graphical user interface for writing and debugging such grammars.
-
XRay
-
YAGOYAGO is a huge semantic knowledge base. Currently, YAGO knows more than 2 million entities (like persons, organizations, cities, etc.). It knows 20 million facts about these entities. Unlike many other automatically assembled knowledge bases, YAGO has a manually confirmed accuracy of 95%.
-
Yahoo! Answers Comprehensive Questions and Answers
-
Yahoo! Answers Manner Questions
-
Yahoo! Answers Question Types
-
Yahoo! Answers Search Query Logs for Nine Languages
-
Yahoo! Learning to Rank Challenge
-
YamChaYamCha (Yet Another Multipurpose CHunk Annotator) is a generic, customizable, and open source text chunker oriented toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking. YamCha is using a state-of-the-art machine learning algorithm called Support Vector Machines (SVMs), first introduced by Vapnik in 1995.
-
Die Zeit online
-
Cognates
-
crf++
-
Dict.cc
-
gensim
-
musiXmatch dataset
-
sdewacA 0.88 billion word corpus derived from deWaC, duplicate sentences and some noise have been removed. The corpus has been converted to Unicode. SdeWaC comes in two versions, in POS-tagged / lemmatized version or as a one sentence per line format, each supplemented with metadata (e.g. parse error rate).
