| |
- CorpusInterface
class CorpusInterface |
|
This class is an interface to a corpus. It is expected, that
indexCorpus.py has been called somewhen in advance, so that two files
exist in the corpus' root directory: One file to map all complete file
paths to indices ('files.list'), and one file to see all the occurrences
of all words in the files with their position ('words.index').
A corpus directoy must be specified when instanciating.
The method getWordContextes provides context search functionality, with
or without stemming. Stemming means simply that suffixes are ignored.
The method getLesserWorkOrder estimates for which of two words it is
more efficient to search the corpus and get the contextes. |
|
Methods defined here:
- __init__(self, directory, files='files.list', words='words.index')
- getLesserWorkOrder(self, word1, word2)
- Returns a tuple, with that word first, for which there are lesser
entries in the word-index-file, and for which therefore look-up-work
in the corpus is less. More efficiency is expected from this, but
there was no sufficient testing of this claim.
- getWordContextes(self, word, scope, doStemming=False)
- Returns a list of word tuples of length <scope>+1. Every word
tuple contains <word> with <scope>-1 words left and right of its
occurrences in the corpus.
Example (word="are",scope=2):
[
("simply","are","very"),
("you","are","nice"),
("are","these"),
("they","are")
]
In the last two elements, "are" is the first and the last word
of the file respectively.
If <doStemming> is True, suffixes are ignored when searching
for <word>.
| |