Studienprojekt SumIt!


Table of Contents

Abstract
Contents of this package

Abstract

With a constantly increasing amount of information, automatic text summarization is becoming more and more a necessity. The currently most popular approach is to grade the sentences of a text, delete the least "valuable" ones, and output the highest scored senteces as the final summary. The most prominent examples are the summarization utility in MS-Word and Nadav Rotem's Open Text Summarizer. From the latter we know that the grading of the sentences is primarily done by calculating the frequency of equal word stems. A sentence is the more valuable the more high-frequent stems it contains.

This Studienprojekt aimed at creating a Python-based implementation of the Open Text Summarizer with the exception that lexical chains depiciting the referential structure of the text are used for sentence grading. The referential identity between whole noun phrases was supposed to be put into consideration. In the end, we wanted to find out, whether this extra-effort of using a lexical-semantic knowledge base and integrating linguistic hacks can make a difference.

The idea originated in Prof. Hellwig's Hauptseminar "Maschinelle Textzusammenfassung" (SS 2004) where we came up with an algorithm building lexical chains that would be based on the text's referential progression. This algorithm was supposed to be integrated into this project. By considering the referential strucuture of the text, we wanted to take a different direction than that taken by our predecessors such as Barzilay and Elhadad (1997) and Brun [et. al.] 2001.

Contents of this package

final release of Studienprojekt "SumIt" ../sumit.tar.gz.

SumIt as a Unix command line utility src/sumit.py.

the tagging module src/tagging.py.

the tokenizer module src/tokenizer.py.

the lexical chain construction module src/chainer.py.

the summarizing module src/summarizer.py.

the data structure providing the finite state automaton src/fsautomathon.py.

the wordnet binding for python src/wordnet.c.

the inline code documentation created with happydoc doc/codedoc/codedoc.html.

the actual text-based documentation including the evaluation doc/index.html.

example texts that can be processed by SumIt res/texts/.

ressources for creating the finite state automaton res/fsa.