BoostCLIR: JP-EN Relevance Marked Patent Corpus

BoostCLIR is a bilingual (Japanese-English) corpus of patent abstracts, extracted from the MAREC patent data, and the data from the NTCIR PatentMT workshop collections, accompanied with relevance judgements for the task of patent prior-art search.

Important: The English side of the corpus contains patent IDs as well as the text of the abstracts. The Japanese side only contains patent IDs because of NTCIR copyright restrictions. The Japanese patent abstracts can be extracted from full text Japanese patent documents, which are available from the organizers of the NTCIR workshop.

Terms of Use

BoostCLIR is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

If you use the corpus in your work, please cite (Sokolov et al., 2013).

Data

The corpus contains training, development and testing subsets sampled from non-intersecting time periods.

Relevance judgement for patent retrieval are constructed from patent citations by assigning three integer levels to three categories of relationships, with highest relevance (3) for family patents, lower relevance for patents cited in search reports by patent examiners (2), and lowest relevance level (1) for applicants’ citations.

For a detailed description of the corpus construction process, please see the above publication.

	#queries	#relevant docs	#unique docs
train	107,061	1,422,253	888,127
dev	2,000	26,478	25,669
test	2,000	25,173	24,668

Table 1. Statistics of ranking data

Format

Each of train, development and test subsets comes in three files:
1. Japanese queries data (.queries file)
2. English documents data (.docs file)
3. relevance judgements (.qrels file)
The format of queries and documents files is:
```
patent-id [TAB] abstract-text
```
(english only!)

The format of the relevance judgements file is:

jp-patent-id [TAB] en-patent-id [TAB] relevance-level

Download

Parallel data: boostclir.tar.gz (242MB, md5: ba7a6af00a68288e8b4a570c87537e85)

You can download the MAREC data set, which contains the source documents, from TU Wien and order NTCIR collections from organizers of the NTCIR PatentMT task.

Publication

Artem Sokolov, Laura Jehl, Felix Hieber and Stefan Riezler

Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings

Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Seattle, WA, USA, 2013

pdf | bib

@inproceedings{sokolov2013b,
  author = {Sokolov, Artem and Jehl, Laura and Hieber, Felix and Riezler, Stefan},
  title = {Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings},
  journal = {Proceedings of the Conference on Empirical Methods in Natural Language Processing},
  journal-abbrev = {EMNLP},
  year = {2013},
  city = {Seattle, WA},
  country = {USA},
  url = {https://www.cl.uni-heidelberg.de/~riezler/publications/papers/EMNLP13.pdf}
}