1     PROJECT IDENTIFICATION

1.1  Project Title:Extraction of linguistic information from Penn Treebank

1.2  Acronym: Extraction

1.3  Keywords: reference tool perl penn Treebank

1.4  Duration: 9 months.

1.5  Intended Starting Date: January 2001

1.6        Participant list:

  Liu Lezhong

 

Disscription:

The tool is designed to be used in three applications:

generating training data for machine learning of co­reference relations, evaluating theories

of referring expression generation and resolution in texts, and developing theories for un­

derstanding reference in dialogs. The need to mark any of a broad set of relations which may

span several levels of discourse structure drives the system architecture. The system has the

ability to collect statistics over encoded relations and measure inter­coder reliability, and

includes tools to increase the accuracy of the user's markings by highlighting the discrep­

ancies between two sets of markings. Using parsed corpora as the input further reduces the

human workload and increases reliability.

 

 

 

 

Extracting Markables

 

In this context, a markable is a text span representing a discourse entity which can be anaphorically referred to in a text or dialog. The majority of markables are noun phrases. Because the Treebank is a fully­parsed and well­defined representation of the text, it is trivial to determine the boundaries of all of the NP's in the text. However, the full set of NP's found by the Tree­bank parse is too inclusive for our purposes (i.e., it is a superset of the NP markables). While the Treebank delineates all NP's at all levels of embedding, it is not thecase that each such NP contributes a distinct DE. Consider the following example containing three NP's in the parsed Treebank:

 

 

 

 

 

 

 

 

 

(1) (NP (NP different parts) (PP of (NP Europe)))

They want to mark both ``different parts of Europe'' and ``Europe'', since they both contribute distinct DE's. How­ ever, notice that ``different parts'' does not contribute a DE since it is not possible to refer to this subexpression alone in subsequent discourse. To avoid finding such undesirable NP's, our system has a heuristic (H1) which says: Pass over any NP which is a leftmost child of a top­level NP. This heuristic is too drastic, though, eliminating constructions like (2).

(2) (NP (NP the inner brain) and (NP the eyes))

To avoid losing these examples, they include another heursitic (H2) which says: H1 does not apply when the NP is a sibling of another NP. A third heuristic must be added to overrule H1 in the case of a possessor in a possessive construction, such as:

(3) (NP (NP Chicago's) South Side)

where they should extract both ``Chicago'' and ``Chicago's South Side''. So, the heuristic H3 is introduced: H1 does not apply when the NP is a possessive form. Even with heuristics eliminating the NP's which they

do not need to consider, there are some NP's that will be found by the system which cannot be eliminated automatically. Copular constructions such as (4) introduce unnecessary NP's.

(4) John is a doctor.

``John'' and ``a doctor'' are syntactically NP's, but the second does not contribute a unique DE.

Also, idiomatic expressions such as (5) must be eliminated by hand:

(5) Ned kicked the bucket.

The syntactic NP ``the bucket'' refers to no DE and cannot be the antecedent of any future referring expression, so it should not be marked. At this time, they do not have a way for the expression extracting system to detect and avoid these examples. As a result, they must introduce a correction phase in which a human corrects the markings, eliminating those that are superfluous, and adjusting those that may have been mismarked. The goal is to have a set of expressions which is as close as possible to the set of expressions necessary and sufficient for the applications. For example, if there are many extraneous expressions in the machine learning task, they will act as distractors -- examples which decrease the accuracy of the learned model by diluting the highly correlative data with noise.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Extracting Features

 

In addition to extracting many markables themselves, the parsed corpora contain information from which many of the features can be automatically derived. Some features' values are marked explicitly in the corpus while others can be automatically extracted by examining the tree structure. The simplest source of feature values is the Treebank ``functional tags''. For example, the grammatical functions (syntactic subject, topicalization, logical subject of passives, etc.) of phrases and the semantic role (vocative, location, manner, etc.) are marked in the corpus.

Other features must be found by walking the tree structure provided in the Treebank. The form of the

NP (whether the NP is realized as a personal pronoun, demonstrative pronoun, or definite description) is a function of the part­of­speech tags assigned to the words in the NP. Whether the NP is definite, indefinite, or indeterminable depends on whether an article begins the NP. If the article is ``a'', ``an'', or ``some'', they assume the NP is indefinite. ``The'' indicates definiteness; otherwise, they assign a value of ``none'', which simply indicates that there is no simple way of classifying this instance. The case of an NP is usually determined by its position in the tree. Any child of a VP is marked as an ``object''. Children of PP's are marked ``prep­adjunct'' unless the PP was tagged ``PP­put'' 1 , which indicates that the PP acts as a complement to the verb. In this case they tag the NP as ``prep­complement''.

 

 

2.1        Project objective(s)

 

 

List the objectives of the project as precisely as possible, in phrases of the form "to investigate , to provide, etc.". Wherever possible, quantify the objectives. Justify the proposed research and development.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

To examine the phenomenon of reference in discourse, and to analyze how discourse structure and reference interact, they need a tool which allows several kinds of functionality including mark­up, visualization, and evaluation. Before designing such a tool, they must carefully analyze the kinds of information each application requires.

Three applications have driven the design of the system. These are: 1) the creation of training data for automatic derivation of reference resolution algorithms (i.e., machine learning), 2) the formation of a testbed for evaluating proposed reference generation and anaphora resolution theories, and 3) the development of theories about understanding reference in dialog.

 

2.2  Technical Baseline

 

This section should describe the state of the art in the area of research and development of the project. Explain how the proposed project will be innovative. When explaining the technical feasibility of the proposal, indicate where there are risks of not achieving the objectives.

 

All of the applications depend on having a corpus of reliably marked expressions, features, and relations. In order to determine that these dimensions have been ``reliably marked'', they need to measure agreement between two coders marking the same text. One way to increase the reliability of the coding (re­ gardless of the method used to measure reliability) is to automate part of the coding process. Our system can extract a number of markings, features and relations from the parsed, part­of­speech­tagged corpora of the type

found in in the Penn Treebank 2 (Marcus et al., 1994).

Use of the Treebank data means they can find most of the markables and many of the necessary features before giving the task to a human coder. They do not try to extract any of the co­reference information from the parsed corpora.

 

 

2.3  Implementation

 

List the functionality of the system. Describe the project components. A graphical presentation of the  project components may be useful. Describe the overall organization of work that will lead the participants to achieve the objectives of the project (investigations, prototypes, milestones, presentations).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

They use mainly Perl.

The processe is:

1)Penn Treebank-->Extract all the NPs.

2)All the NPs-->The NPs they really need(according to Extracting Markables in 2 )

3)To get the attribute informations :

--Semantic Role

   NP Form

--Case

   NP Depth

   Sentence Depth

--Text

--Grammatical Role