Final report A. short detail fuction of each subroutines 1.preprocessing() input: raw Penn treebank file output: prefix.new01 containing each sentence with just one newline at the end Function: removes not necessary whitespace and newline in every sentence and the explanations at the beginn of the file. 2.puretext() input: prefix.new01 containing each sentence with just one newline at the end output: The pure text but with 's 'm 'n't 'd 're problems Function: to extract all the terminal(text) aut of the Treebank. 3.sproblem() input: prefix.withs output: prefix.pure without 's 'm 'n't 'd 're problems Function: to clean those 's 'm 'n't 'd 're problems 4.terminalcount() input: prefix.withs output: prefix.pure without 's 'm 'n't 'd 're problem Function: to count the positions of the terminals 5.phasenprint() input: the prefix.new01 output: all the phrases with the postion informations Function: print all the phrases with the postion informations 6.goback() input: the prefix.phrase and prefix.npdepth output: the phrase without the teminal position informations the npdepth information with phrase indexes removes all the teminal position informations so that later we can use the recursive program to extract the real entity removes the additional information in infile.NPDEPTH 7.subknote() input: $infile.back output: creats a list of the index of the phrase which we need function: For example: (NP (NP different parts) (PP of (NP Europe))) They want to mark both ``different parts of Europe'' and ``Europe'', since they both contribute distinct DE's. How­ ever, notice that ``different parts'' does not contribute a DE since it is not possible to refer to this subexpression alone in subsequent discourse. To avoid finding such undesirable NP's, our system has a heuristic (H1) which says: Pass over any NP which is a leftmost child of a top­level NP. This heuristic is too drastic, though, eliminating constructions like (2). (2) (NP (NP the inner brain) and (NP the eyes)) To avoid losing these examples, they include another heursitic (H2) which says: H1 does not apply when the NP is a sibling of another NP. A third heuristic must be added to overrule H1 in the case of a possessor in a possessive construction, such as: (3) (NP (NP Chicago's) South Side) where they should extract both ``Chicago'' and ``Chicago's South Side''. So, the heuristic H3 is introduced: H1 does not apply when the NP is a possessive form. Even with heuristics eliminating the NP's which they do not need to consider, there are some NP's that will be found by the system which cannot be eliminated automatically. Copular constructions such as (4) introduce unnecessary NP's. 8.npphrase() input: prefix.phrase output: NP phrases with the begin and end positions of each phrase and add it at the end of each phrase function: get the NP phrases out of all the phrases and extract the begin and end positions of each phrase and add it at the end of each phrase we need 9.npprad() input: infile.phrasenp output: the npphrases without NP-PRD List1 is the index of phrases we don't need,the list3 muss be worded later function: get rid of the NP in NP-PRD that we don't need produce the list1 in which the index of phrases that we don't need Workpackage 4: some execptions and test 10nppad2() input: list3 and all the NP phrase output: List4 contain the index of the phrase that we don't need function: for the NP-PRD we should get rid of all the sonknote of the first subknote and also if there is a , , in NP-PRD the first knote after it muss also be deleted 11.final() input: list1 list4 and the NP phrases output: the realnp we need function: get rid of all the phrases we don't need 12.nummer() input: cf02.realnp ,cf02.npdepth2 output:.cf02.info and .cf02.attr B: the cf02.pure and .cf02.info and .cf02.attr are the input for another tool, which calls mate. the result of the project is with and other programm in c almost the same (99.3%).