xon: Methods for automatic extractions of multi-word terms

authors: Bastian Bolender, Sebastian Kreß, Jannik Strötgen

Abstract

Larger terminologies typically contain a considerable portion of terms that may occur in free text in somewhat modified forms, due to orthographic, inflectional, syntactic or other variations. We describe a setup that combines symbolic heuristics with linguistic information to improve the identification of terms from free text. We present experiences with the approach in the framework of a finite-state transducer based information extraction system on biomedical documents using the well-known MeSH thesaurus.

Taking existing larger thesauri and use them directly for automatic indexing on free text typically faces the problem that many relevant terms are not contained as such in the documents. Often the correspondence between a thesaurus term and a string from the document can be considered to have undergone some transformations due to orthographic, inflectional, syntactic or other processes.

We work on typical cases of such variations that cause a naive matching process to miss the respective terms and suggest a typology of these cases. Subsequently, we suggest some symbolic and linguistic clues to address the mismatches. For example, a multi-word noun phrase can often be found in different syntactic variations (lung cancer, cancer of the lung, etc) whose equivalence requires access to linguistic information such as part-of-speech, morphology, etc.

The three main parts of our are a linguistic approach, a quantitative approach and the evaluation of these two. In addition, the results are compared to our gold standard (the manual annotation of the PubMed-files) and an extraction from TEMIS Deutschland GmbH. The results of the evaluation can of course be produced by the given scripts but are also included in the evaluation presentation.

Source code

Multiword identification: linguistic MultiWordTermFinder, quantitative MultiWordTermFinder

Tools: create MeSH Lemma database, get_abstracts_and_titles, retrieve abstracts and titles from PubMed articles, medlineClean

Evaluation: evaluation

Documents

final presentations: linguistic, quantitative, evaluation

input: pubmed2004 (first 9998 documents), MeSH descriptors 2005, MeSH tree 2005

download project-archive

last update: 10.02.2006