1 PROJECT IDENTIFICATION
1.1 Project Title:Extraction of linguistic information
from Penn Treebank
1.2 Acronym: Extraction
1.3 Keywords: reference tool perl penn Treebank
1.4 Duration: 9 months.
1.5 Intended Starting Date: January 2001
1.6 Participant list:
Liu Lezhong
Disscription:
The tool is
designed to be used in three applications:
generating
training data for machine learning of coreference relations, evaluating
theories
of referring
expression generation and resolution in texts, and developing theories for un
derstanding
reference in dialogs. The need to mark any of a broad set of relations which
may
span several
levels of discourse structure drives the system architecture. The system has
the
ability to collect
statistics over encoded relations and measure intercoder reliability, and
includes tools to
increase the accuracy of the user's markings by highlighting the discrep
ancies between two
sets of markings. Using parsed corpora as the input further reduces the
human workload and
increases reliability.
Extracting
Markables
In this context, a
markable is a text span representing a discourse entity which can be
anaphorically referred to in a text or dialog. The majority of markables are
noun phrases. Because the Treebank is a fullyparsed and welldefined
representation of the text, it is trivial to determine the boundaries of all of
the NP's in the text. However, the full set of NP's found by the Treebank
parse is too inclusive for our purposes (i.e., it is a superset of the NP
markables). While the Treebank delineates all NP's at all levels of embedding,
it is not thecase that each such NP contributes a distinct DE. Consider the
following example containing three NP's in the parsed Treebank:
(1) (NP (NP
different parts) (PP of (NP Europe)))
They want to mark
both ``different parts of
(2) (NP (NP the
inner brain) and (NP the eyes))
To avoid losing
these examples, they include another heursitic (H2) which says: H1 does not
apply when the NP is a sibling of another NP. A third heuristic must be added
to overrule H1 in the case of a possessor in a possessive construction, such
as:
(3) (NP (NP
Chicago's) South Side)
where they should
extract both ``
do not need to
consider, there are some NP's that will be found by the system which cannot be
eliminated automatically. Copular constructions such as (4) introduce
unnecessary NP's.
(4) John is a
doctor.
``John'' and ``a
doctor'' are syntactically NP's, but the second does not contribute a unique
DE.
Also, idiomatic
expressions such as (5) must be eliminated by hand:
(5) Ned kicked the
bucket.
The syntactic NP
``the bucket'' refers to no DE and cannot be the antecedent of any future
referring expression, so it should not be marked. At this time, they do not
have a way for the expression extracting system to detect and avoid these
examples. As a result, they must introduce a correction phase in which a human
corrects the markings, eliminating those that are superfluous, and adjusting
those that may have been mismarked. The goal is to have a set of expressions
which is as close as possible to the set of expressions necessary and
sufficient for the applications. For example, if there are many extraneous
expressions in the machine learning task, they will act as distractors --
examples which decrease the accuracy of the learned model by diluting the
highly correlative data with noise.
Extracting
Features
In addition to
extracting many markables themselves, the parsed corpora contain information
from which many of the features can be automatically derived. Some features'
values are marked explicitly in the corpus while others can be automatically
extracted by examining the tree structure. The simplest source of feature
values is the Treebank ``functional tags''. For example, the grammatical
functions (syntactic subject, topicalization, logical subject of passives,
etc.) of phrases and the semantic role (vocative, location, manner, etc.) are
marked in the corpus.
Other features
must be found by walking the tree structure provided in the Treebank. The form
of the
NP (whether the NP
is realized as a personal pronoun, demonstrative pronoun, or definite
description) is a function of the partofspeech tags assigned to the words in
the NP. Whether the NP is definite, indefinite, or indeterminable depends on
whether an article begins the NP. If the article is ``a'', ``an'', or ``some'',
they assume the NP is indefinite. ``The'' indicates definiteness; otherwise,
they assign a value of ``none'', which simply indicates that there is no simple
way of classifying this instance. The case of an NP is usually determined by
its position in the tree. Any child of a VP is marked as an ``object''.
Children of PP's are marked ``prepadjunct'' unless the PP was tagged
``PPput'' 1 , which indicates that the PP acts as a complement to the verb. In
this case they tag the NP as ``prepcomplement''.
2.1 Project objective(s)
List the
objectives of the project as precisely as possible, in phrases of the form
"to investigate , to provide, etc.". Wherever possible, quantify the
objectives. Justify the proposed research and development.
To examine the
phenomenon of reference in discourse, and to analyze how discourse structure
and reference interact, they need a tool which allows several kinds of
functionality including markup, visualization, and evaluation. Before
designing such a tool, they must carefully analyze the kinds of information
each application requires.
Three applications
have driven the design of the system. These are: 1) the creation of training
data for automatic derivation of reference resolution algorithms (i.e., machine
learning), 2) the formation of a testbed for evaluating proposed reference
generation and anaphora resolution theories, and 3) the development of theories
about understanding reference in dialog.
2.2 Technical Baseline
This section
should describe the state of the art in the area of research and development of
the project. Explain how the proposed project will be innovative. When
explaining the technical feasibility of the proposal, indicate where there are
risks of not achieving the objectives.
All of the
applications depend on having a corpus of reliably marked expressions,
features, and relations. In order to determine that these dimensions have been
``reliably marked'', they need to measure agreement between two coders marking
the same text. One way to increase the reliability of the coding (re gardless
of the method used to measure reliability) is to automate part of the coding
process. Our system can extract a number of markings, features and relations
from the parsed, partofspeechtagged corpora of the type
found in in the
Penn Treebank 2 (Marcus et al., 1994).
Use of the
Treebank data means they can find most of the markables and many of the
necessary features before giving the task to a human coder. They do not try to
extract any of the coreference information from the parsed corpora.
2.3 Implementation
List the functionality of the system. Describe the project components. A graphical presentation of the project components may be useful. Describe the overall organization of work that will lead the participants to achieve the objectives of the project (investigations, prototypes, milestones, presentations).
They use mainly
Perl.
The processe is:
1)Penn
Treebank-->Extract all the NPs.
2)All the
NPs-->The NPs they really need(according to Extracting Markables in 2 )
3)To get the
attribute informations :
--Semantic Role
NP Form
--Case
NP
Depth
Sentence Depth
--Text
--Grammatical Role