Corpora for the coming decade Adam Kilgariff Corpora now play a central role in most NLP and various other branches of linguistics, yet we know very little about how to talk about them. Which corpora are similar to which others, and how can we measure it? Domains often stand in hierarchical relations to each other, and corpora represent domains - so can corpora stand in hierarchical relations to one another? We often want to modify an NLP system to a new domain: can we use the corpora of the domains to estimate how much work it will be? I first asked these questions 15 years ago: at that point there were so few corpora available that the questions were largely academic. Researchers rarely had any choice about which corpus to use. Now, we can gather corpora of virtually unlimited size, of all manner of text types, from the web, so the question is a practical and immediate one. I will describe my earlier work on corpus comparison, and how we are currently extending it. I will also present an overview of how we currently build corpora, and how we think we can make them bigger and better (and what that means) over the next few years. The work has made extensive use of the Sketch Engine, and the talk will include a Sketch Engine demo.