Web N-Grams as a Resource for Corpus Linguistics (Stefan Evert, Technische Universität Darmstadt) In recent years, the rapid growth of the World Wide Web has enabled research in computational linguistics to scale up to Web-derived corpora a thousand times the size of the British National Corpus (BNC) and more. These huge text collections open up entirely new possibilities for training statistical models and unsupervised learning algorithms. With the release of Google's Web 1T 5-gram database (Brants & Franz 2006), a corpus on the teraword scale came within reach of the general research community for the first time, in the form of n-gram frequency tables. Since then, the Web1T5 database has been applied to a wide range of natural language processing tasks. In addition to the obvious use as training data for broad-coverage n-gram models (e.g. as part of a machine translation or speech recognition system), the database has been used for spelling correction, as a convenient replacement for online Web queries e.g. in knowledge mining, and even for the prediction of fMRI neural activation associated with concrete nouns (Mitchell et al. 2008). Computer scientists have also developed specialised indexing engines that allow fast interactive queries to the database, impressively demonstrated e.g. by http://www.netspeak.org/ (Stein et al. 2010). In my talk, I explore the usefulness of Web1T5 and similar n-gram databases as a resource for corpus linguistic studies, despite its well-known shortcomings: the inevitable frequency thresholds, a genre composition dominated by computer science, porn and advertising, an abundance of text duplicates and boilerplate, as well as a complete lack of linguistic annotation (lemmatisation and part-of-speech tagging). As an example, I show how three essential types of corpus analysis -- word and phrase frequencies, collocational profiles, and distributional semantics -- can be carried out on Web1T5. A prerequisite for more wide-spread adoption of n-gram databases in corpus linguistics is the availability of open-source indexing software that is flexible enough to support these types of corpus analysis, fast enough for interactive exploration of the database, and that runs on off-the-shelf desktop hardware. I present a simple and convenient solution building on SQLite (an embedded relational database engine), Perl and the statistical software package R (Evert 2010). The last part of my talk attempts an evaluation of Web1T5 as a linguistic resource. For this purpose, frequency counts for words and n-grams are compared with the BNC and other standard corpora, and Web1T5 is applied to several collocation extraction and semantic similarity tasks. A closer look at the evaluation results reveals some fundamental differences between a Web-based n-gram database and traditional corpora. In this way, I hope to shine new light on the question whether more data are really always better data (Church & Mercer 1993). REFERENCES Brants, Thorsten and Franz, Alex (2006). Web 1T 5-gram Version 1. Linguistic Data Consortium, Philadelphia, PA. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13. Church, Kenneth W. and Mercer, Robert L. (1993). Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics, 19(1), 1-24. Evert, Stefan (2010). Google Web 1T5 n-grams made easy (but not for the computer). In Proceedings of the 6th Web as Corpus Workshop (WAC-6), Los Angeles, CA. Mitchell, Tom M.; Shinkareva, Svetlana V.; Carlson, Andrew; Chang, Kai-Min; Malave, Vicente L.; Mason, Robert A.; Just, Marcel Adam (2008). Predicting human brain activity associated with the meanings of nouns. Science, 320, 1191-1195. Stein, Benno; Potthast, Martin; Trenkmann, Martin (2010). Retrieving customary Web language to assist writers. In C. Gurrin, Y. He, G. Kazai, U. Kruschwitz, S. Little, T. Roelleke, S. M. Rüger, and K. van Rijsbergen (eds.), Advances in Information Retrieval: 32nd European Conference on Information Retrieval (ECIR ’10), volume 5993 of Lecture Notes in Computer Science, pages 631-635. Springer, Berlin, Heidelberg, New York.