Python: module corpusInterface

corpusInterface

index
/home/max/LRApy/corpusInterface.py

Project: LRApy Author: Max Jakob (max.jakob@web.de) Module: corpusInterface Module Description: This Module provides an interface for word context search in a prepared corpus. To prepare a corpus see the indexCorpus module. Version: 1.1 Last change: 2007-02-07 Copyright 2007 by Max Jakob. This code is released under the GNU GPL. See the accompanying LICENSE file. Embedded documentation can be translated with the Python pydoc module.

Modules

os
re

Classes



CorpusInterface

class CorpusInterface

    This class is an interface to a corpus. It is expected, that indexCorpus.py has been called somewhen in advance, so that two files exist in the corpus' root directory: One file to map all complete file paths to indices ('files.list'), and one file to see all the occurrences of all words in the files with their position ('words.index'). A corpus directoy must be specified when instanciating. The method getWordContextes provides context search functionality, with or without stemming. Stemming means simply that suffixes are ignored. The method getLesserWorkOrder estimates for which of two words it is more efficient to search the corpus and get the contextes.

Methods defined here:

__init__(self, directory, files='files.list', words='words.index')

getLesserWorkOrder(self, word1, word2)
Returns a tuple, with that word first, for which there are lesser entries in the word-index-file, and for which therefore look-up-work in the corpus is less. More efficiency is expected from this, but there was no sufficient testing of this claim.

getWordContextes(self, word, scope, doStemming=False)
Returns a list of word tuples of length <scope>+1. Every word tuple contains <word> with <scope>-1 words left and right of its occurrences in the corpus.   Example (word="are",scope=2):   [    ("simply","are","very"),    ("you","are","nice"),    ("are","these"),    ("they","are")   ]   In the last two elements, "are" is the first and the last word   of the file respectively. If <doStemming> is True, suffixes are ignored when searching for <word>.

Functions

Looker(...)
Creates a new looker instance.

Data

MAX_WORD_LENGTH = 15

Functions
		Looker(...) `Creates a new looker instance.`