WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia

WikiCLIR is a large-scale (German-English) retrieval data set for Cross-Language Information Retrieval (CLIR). It contains a total of 245,294 German single-sentence queries with 3,200,393 automatically extracted relevance judgments for 1,226,741 English Wikipedia articles as documents. Queries are well-formed natural language sentences that allow large-scale training of (translation-based) ranking models.

Terms of Use

WikiCLIR is licensed under a Creative Commons BY-SA 4.0 Unported License.

If you use the corpus in your work, please cite: (Schamoni et al., 2014).

Data

The corpus contains training, development and testing subsets randomly split on the query level.

Relevance judgments for Cross-Language Information Retrieval (CLIR) are constructed from the inter-language links between German and English Wikipedia articles. A relevance level of (3) is assigned to the (English) cross-lingual mate, and level (2) to all other (English) articles that link to the mate, AND are linked by the mate. Our intuition for this level (2) is that articles in a bidirectional link relation to the mate are likely to either define similar concepts or are instances of the concept defined by the mate.

For a more detailed description of the corpus construction process, see publication.

#queries #documents #relevant documents per query #words per query
train 225,294 1,226,741 13.04 25.80
dev 10,000 113,553 12.97 25.75
test 10,000 115,131 13.22 25.73

Format

Each of train, development and test subsets comes in three files:

  1. German queries data (.queries file)
  2. English documents data (.docs file)
  3. relevance judgments (.qrels file)

The format of a query file is:

DE-wiki-page-id [TAB] first sentence (with article title removed)

The format of a document file is:

EN-wiki-page-id [TAB] article (200 words max)

The format of the relevance judgments file is:

DE-wiki-page-id [TAB] EN-wiki-page-id [TAB] relevance-level

Download

WikiCLIR_v1.tar.gz (v1, 05/08/2015, 861MB, md5: 922b67273cfeb681cc902428181504aa)

Publication

  1. Shigehiko Schamoni, Felix Hieber, Artem Sokolov and Stefan Riezler
    Learning Translational and Knowledge-based Similarities from Relevance Rankings for Cross-Language Retrieval
    Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), Baltimore, MD, USA, 2014
    @inproceedings{schamoni2014,
      author = {Schamoni, Shigehiko and Hieber, Felix and Sokolov, Artem and Riezler, Stefan},
      title = {Learning Translational and Knowledge-based Similarities from Relevance Rankings for Cross-Language Retrieval},
      journal = {Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics},
      journal-abbrev = {ACL},
      year = {2014},
      city = {Baltimore, MD},
      country = {USA},
      url = {https://www.cl.uni-heidelberg.de/~riezler/publications/papers/ACL2014short.pdf}
    }