Web N-Grams as a Resource for Corpus Linguistics
(Stefan Evert, Technische Universität Darmstadt)


In recent years, the rapid growth of the World Wide Web has enabled research 
in computational linguistics to scale up to Web-derived corpora a thousand times 
the size of the British National Corpus (BNC) and more. These huge text collections 
open up entirely new possibilities for training statistical models and unsupervised 
learning algorithms. With the release of Google's Web 1T 5-gram database 
(Brants & Franz 2006), a corpus on the teraword scale came within reach of the
general research community for the first time, in the form of n-gram frequency tables.
Since then, the Web1T5 database has been applied to a wide range of natural language 
processing tasks. In addition to the obvious use as training data for broad-coverage 
n-gram models (e.g. as part of a machine translation or speech recognition system), 
the database has been used for spelling correction, as a convenient replacement for 
online Web queries e.g. in knowledge mining, and even for the prediction of fMRI neural
activation associated with concrete nouns (Mitchell et al. 2008). Computer scientists 
have also developed specialised indexing engines that allow fast interactive queries to 
the database, impressively demonstrated e.g. by http://www.netspeak.org/ 
(Stein et al. 2010).

In my talk, I explore the usefulness of Web1T5 and similar n-gram databases as a 
resource for corpus linguistic studies, despite its well-known shortcomings: the 
inevitable frequency thresholds, a genre composition dominated by computer science,
porn and advertising, an abundance of text duplicates and boilerplate, as well as a
complete lack of linguistic annotation (lemmatisation and part-of-speech tagging). 
As an example, I show how three essential types of corpus analysis -- word and phrase
frequencies, collocational profiles, and distributional semantics -- can be carried 
out on Web1T5.

A prerequisite for more wide-spread adoption of n-gram databases in corpus linguistics 
is the availability of open-source indexing software that is flexible enough to support 
these types of corpus analysis, fast enough for interactive exploration of the 
database, and that runs on off-the-shelf desktop hardware. I present a simple and 
convenient solution building on SQLite (an embedded relational database engine), Perl 
and the statistical software package R (Evert 2010).

The last part of my talk attempts an evaluation of Web1T5 as a linguistic resource. 
For this purpose, frequency counts for words and n-grams are compared with the BNC 
and other standard corpora, and Web1T5 is applied to several collocation extraction 
and semantic similarity tasks. A closer look at the evaluation results reveals some 
fundamental differences between a Web-based n-gram database and traditional corpora. 
In this way, I hope to shine new light on the question whether more data are really 
always better data (Church & Mercer 1993).


REFERENCES

Brants, Thorsten and Franz, Alex (2006). 
Web 1T 5-gram Version 1. Linguistic Data Consortium, Philadelphia, PA.
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13.

Church, Kenneth W. and Mercer, Robert L. (1993).
Introduction to the special issue on computational linguistics using large corpora.
Computational Linguistics, 19(1), 1-24.

Evert, Stefan (2010).
Google Web 1T5 n-grams made easy (but not for the computer).
In Proceedings of the 6th Web as Corpus Workshop (WAC-6), Los Angeles, CA.

Mitchell, Tom M.; Shinkareva, Svetlana V.; Carlson, Andrew; Chang, Kai-Min;
Malave, Vicente L.; Mason, Robert A.; Just, Marcel Adam (2008).
Predicting human brain activity associated with the meanings of nouns.
Science, 320, 1191-1195.

Stein, Benno; Potthast, Martin; Trenkmann, Martin (2010).
Retrieving customary Web language to assist writers.
In C. Gurrin, Y. He, G. Kazai, U. Kruschwitz, S. Little, T. Roelleke, S. M. Rüger,
and K. van Rijsbergen (eds.), Advances in Information Retrieval: 32nd European Conference 
on Information Retrieval (ECIR ’10), volume 5993 of Lecture Notes in Computer Science,
pages 631-635. Springer, Berlin, Heidelberg, New York.