NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval
NFCorpus is a full-text English retrieval data set for Medical Information Retrieval. It contains a total of 3,244 natural language queries (written in non-technical English, harvested from the NutritionFacts.org site) with 169,756 automatically extracted relevance judgments for 9,964 medical documents (written in a complex terminology-heavy language), mostly from PubMed.
If you use the corpus in your work, please cite:
Vera Boteva, Demian Gholipour, Artem Sokolov, Stefan Riezler. "A Full-Text Learning to Rank Dataset for Medical Information Retrieval". In Proceedings of the 38th European Conference on Informational Retrieval (ECIR 2016), Padova, Italy. (pdf, bib)
The corpus contains training, development and testing subsets randomly split on the query level (correspondingly, 80%, 10% and 10%).
Relevance judgments are constructed from the direct and indirect links on the NutritionFacts.org website. The most relevant level corresponds to a direct link from a NutritionFacts article (query) to a medical article (document) from the cited sources section of a page, the next level is used between a query that links internally another NutritionFacts article that in turn links directly a medical document. Finally, the lowest level is reserved for queries and document connected through a topic/tag system on the NutritionFacts.org.
For a more detailed description of the corpus construction process, see the above publication.
FormatEach of the training, development and testing subsets comes in three files:
- NutritionFacts.org queries data (.queries files, 5 different types) -- natural, non-technical language
- medical documents data (.docs files) -- medical, very technical language
- relevance judgments (.qrel files)
- Which flavor of NDCG was used?
We used the trec_eval script v9.0 from NIST.
- How many documents were retrieved in the experiments?
Top-1000 documents were retrieved. For other details of the setup see Sec 6.1 of our previous paper.
- Which part of the dataset was used to obtain df's for tfidf and the average document length for BM25?
Full dataset. Evaluation of these baseline methods was still done on the test subset.