
Multilingual resource creation
Cross-lingual resource induction using parallel corpora
Manual creation of high-quality language resources (Treebanks, lexicons) and NLP processing tools (such as taggers, parsers) for multiple languages is a tedious and costly task, and thus prohibitive to undertake for multiple languages. The technique of annotation projection exploits parallel word-aligned corpora to induce target language annotations using automatically assigned source language annotations as a basis.
We have applied this technique for the induction of a temporal labeler for German, by projecting TimeML markup from English annotations created with the TARSQI TimeML labeling toolkit. We further induced an f-structure bank for Polish by projecting English LFG f-structure information to a word-aligned Polish section of the JRC-Acquis corpus.
- We have applied this technique for the induction of a temporal labeler for TimeML annotations in German, using the TARSQI TimeML labeling toolkit on aligned English-German Europarl texts (Spreyer 2007, Spreyer and Frank, 2008).
- We induced grammatical function information for Polish by projecting English LFG f-structure information to a word-aligned Polish section of the JRC-Acquis corpus (Wróbleska and Frank, 2009). The obtained Polish f-structure bank will now be used to train a dependency parser for Polish. In further work we will investigate the generation of full-fledged LFG c- and f-structure treebanks, using the model of Klein (2008) for generating c-structures from f-structures in a supervised learning task with minimal amounts of training data.
- Ultimately, the combination of these techniques will enable us to circumvent extensive manual annotation efforts in building high-quality treebanks as a basis for corpus-based grammar induction (cf. LFG-based grammar induction, starting with Frank et al. (2003) till e.g. Cahill et al. (2005)).