The Phrasehunter

Searching and evaluating words and contexts in static text corpora for lexicographical and linguistic research.

written by Torsten Marek and Armin Schmidt

This site is for archival purposes only. For more recent versions, see http://diotavelli.net/phrasehunter/


Features

Download

Source code for linux systems: phrasehunter0_5.tar.gz (8,6MB, includes a small test corpus)
Source code documentation: ph-src-doc0_5.tar.gz (889K)

Please note: The Phrasehunter is still under development. This is version 0.5.

Installation

  1. Before compilation, make sure you have the following installed:
  2. Download and unpack the source code and cd into the so-unpacked phrasehunter directory
  3. Call scons with the debug=no option:
    ~$ /path/to/sourcecode/phrasehunter$ scons debug=no
  4. Consider adapting $PATH to include /path/to/sourcecode/phrasehunter/phgui/, /path/to/sourcecode/phrasehunter/ph-admin/ and /path/to/sourcecode/phrasehunter/ph-indexer/

Usage

Indexing a corpus

  1. Before indexing, you need to create and initialize the corpus data base. The programm ph-admin does all that automatically for you. Simply call:
    ~$ ph-admin create path/corpus-name
    where corpus-name should be the name you want your corpus directory to have. (If you haven't adapted $PATH as recommended above (see Installation), you need to provide the full path to phrasehunter/ph-admin/ph-admin).
  2. Your corpus should consist of small utf8-encoded text files without any html or xml markup.
  3. Now, you're ready to index:
    ~$ ph-indexer corpus-directory textfile
    where corpus-directory is the directory you specified in step 1. textfile is the file to be indexed. Most of the times you probably want to index several files at once. Do that by using wildcards like
    ~$ ph-indexer corpus-directory textfiles/*

The graphical interface

Administering corpora

The tool ph-admin can do much more than just setting up the database, of course. For example, it helps you maintain corpora by providing options to remove single files from the corpus and the index. (More coming soon ...)

Development

The Phrasehunter is still under development. If you found a bug, have questions or suggestions, or would like to help, feel free to contact us: armin.sch@gmail.com. Or get right down to work and start browsing the source documentation.

Current Bugs and Issues

The recent switch of the GUI class design applies Qt's Model/View architecture, which was introduced in Qt4. This proved very usefull as the GUI now is a lot faster than before and modules are nicely factored for better maintainance. There are a couple of issues that still need to be resolved, though: