# MONTY TAGGER - A Brill-Based POS Tagger for Python/Java # # Author: Hugo Liu # Project Page: # # Copyright (c) 2002,2003 by Hugo Liu, MIT Media Lab # Original Brill data (c) Eric Brill, UPenn, M.I.T. # # Use is granted under the GNU General Public License (GPL): # # # About MontyTagger: # - tokenizes and tags English texts # - uses Penn Treebank tagset # - basic tagging based on Brill'94 # - uses Brill94-compatible lexicon and rule files # (LEXICON,LEXICALRULEFILE,CONTEXTUALRULEFILE) included # - basic tagging at 200 words/sec in python # - basic tagging has 96% word-level accuracy # on English non-fiction (same as Brill94) # - written in python, full cross-platform compatibility # - also available as a Java .jar file # # Suggestions for Use and API: # - running "python MontyTagger.py" from the command line # will bring up the interactive interpreter # - type "python MontyTagger.py /?" for command line usage # - Python API: # - tag(text,expand_contractions_p,all_pos_p) # - use this to tokenize & tag text # - returns text in word/NN format # - expand_contractions_p = 0 or 1; changes # contraction handling in tokenizer # - all_pos_p = 0 or 1; if set to 1, will # display all plausible tags for each word # as word/TAG1/TAG2 # - tag_tokenized(text,all_pos_p) # - use this to tag already tokenized text # # # New in Version 1.2: # - lexicon reimplemented; additional optimizations # - 100% tagging speed improvement # - python: (v1.0: 200words/s, v1.2: 500words/s) # - java: (v1.0: 80words/s, v1.2: 200words/s) # # - 160%-400% memory usage improvement # - python: (v1.0: 20mb, v1.2: 5mb) # - java: (v1.0: 40mb, v1.2: 25mb) # # - 400%-1000% improvement in tagger loading time # - python: (v1.0: 10secs, v1.2: 1sec) # - java: (v1.0: 22secs, v1.2: 5secs) # # New in Version 1.0: # - python version tested and benchmarked # - currently TBL training is not implemented # # --please send bugs & suggestions to hugo@media.mit.edu-- #