NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval

NFCorpus is a full-text English retrieval data set for Medical Information Retrieval. It contains a total of 3,244 natural language queries (written in non-technical English, harvested from the site) with 169,756 automatically extracted relevance judgments for 9,964 medical documents (written in a complex terminology-heavy language), mostly from PubMed.

Terms of Use

NFCorpus is free to use for academic purposes. For any other uses of the included data please consult Terms of Service and contact its author Dr. Michael Greger directly.

If you use the corpus in your work, please cite: (Boteva et al., 2016).


The corpus contains training, development and testing subsets randomly split on the query level (correspondingly, 80%, 10% and 10%).

Relevance judgments are constructed from the direct and indirect links on the website. The most relevant level corresponds to a direct link from a NutritionFacts article (query) to a medical article (document) from the cited sources section of a page, the next level is used between a query that links internally another NutritionFacts article that in turn links directly a medical document. Finally, the lowest level is reserved for queries and document connected through a topic/tag system on the

For a more detailed description of the corpus construction process, see the above publication.


Each of the training, development and testing subsets comes in three files:

  1. queries data (.queries files, 5 different types) – natural, non-technical language
  2. medical documents data (.docs files) – medical, very technical language
  3. relevance judgments (.qrel files)

The detailed data format is described in the accompanying README in the archive.


NFCorpus.tar.gz (v1, crawled 27/07/2015, released 19/02/2016, 29.6MB, md5: 49c061fbadc52ba4d35d0e42e2d742fd)


  1. Which flavor of NDCG was used?
    We used the trec_eval script v9.0 from NIST.
  2. How many documents were retrieved in the experiments?
    Top-1000 documents were retrieved. For other details of the setup see Sec 6.1 of our previous paper.
  3. Which part of the dataset was used to obtain df’s for tfidf and the average document length for BM25?
    Full dataset. Evaluation of these baseline methods was still done on the test subset.


  1. Vera Boteva, Demian Gholipour, Artem Sokolov and Stefan Riezler
    A Full-Text Learning to Rank Dataset for Medical Information Retrieval
    Proceedings of the 38th European Conference on Information Retrieval (ECIR), Padova, Italy, 2016
      author = {Boteva, Vera and Gholipour, Demian and Sokolov, Artem and Riezler, Stefan},
      title = {A Full-Text Learning to Rank Dataset for Medical Information Retrieval},
      journal = {Proceedings of the 38th European Conference on Information Retrieval},
      journal-abbrev = {ECIR},
      year = {2016},
      city = {Padova},
      country = {Italy},
      url = {}