Resources / corpora / monolingual / annotated / sdewac-3

 
 

Resources

  • sdewac
    A 0.88 billion word corpus derived from deWaC, duplicate sentences and some noise have been removed. The corpus has been converted to Unicode. SdeWaC comes in two versions, in POS-tagged / lemmatized version or as a one sentence per line format, each supplemented with metadata (e.g. parse error rate).