The Web as an Implicit Training Set: Application to Noun Compound Syntax and
Semantics 

Preslav Nakov
School of Computing, National University of Singapore
http://www.comp.nus.edu.sg/~nakov/

I will present Web-based approaches to
the syntax and semantics of noun compounds (NCs),
which can be used in query parsing, technical term understanding, etc.
I will also describe an application to machine translation.

First, I will present a highly accurate lightly supervised method
based on surface features and paraphrases for
making bracketing decisions for three-word noun compounds,
e.g. "[[liver cell] antibody]" is left-bracketed,
while "[liver [cell line]]" is right-bracketed.
The enormous size of the Web makes such features
frequent enough to be useful.

Then, I will introduce an unsupervised method
for discovering the implicit predicates characterizing
the semantic relations that hold in noun-noun compounds.
For example, "malaria mosquito" is a
"mosquito that carries/spreads/causes/transmits/brings/infects with/...
malaria".

Finally, I will present a method for improving Machine Translation (SMT).
Most modern SMT systems rely on aligned sentences of bilingual corpora
for training. I will describe a method for expanding the training set
with conceptually similar but syntactically differing paraphrases
at the NP-level which involve NCs. The Engish to Spanish evaluation
on the Europarl corpus shows an improvement equivalent to 33%-50%
of that of doubling the amount of training data.