Ruprecht-Karls-Universität Heidelberg

PatTR: Patent Translation Resource

PatTR is a sentence-parallel corpus extracted from the MAREC patent collection. The current version contains more than 22 million German-English and 18 million French-English parallel sentences collected from all patent text sections as well as 5 million German-French sentence pairs from patent titles, abstracts and claims.

Terms of use

PatTR is available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Please cite Wäschle and Riezler (2012b), if you use the corpus in your work, or use the data citation specified in the HeiDATA entry.

Creative Commons License

Data

The corpus is sorted by language pairs and by text sections of a patent document, namely title, abstract, claims and description. Parallel data from title, abstract and claims sections were extracted from documents belonging to the European Patent Office (EPO) and the World Intellectual Property Organization (WIPO) corpora in MAREC. Both resources feature multilingual documents that contain for example both an English and a German abstract.

Since there are no multilingual descriptions, data from this section were collected by exploiting patent families to align German and French documents from the EPO corpus to English documents from the United States Patent and Trademark Office (USPTO) corpus, following Utiyama and Isahara (2007).

All sections were sentence-aligned using the Gargantua aligner. Preprocessing was done automatically. Sentence boundaries were detected using the Europarl processing tools.

For a detailed description of the corpus construction process, please see the publications.

Metadata

In addition to the bitext we provide patent metadata for each sentence:

  • The patent id of the original document, which can be a patent application or a granted patent.
  • The patent family id that groups related documents with mostly overlapping content, e.g. patents for the same invention in different legislations.
  • Publication date.
  • Classification according to the IPC down to subclass level.

Further metadata, e.g. inventor or company, can be found in the original patent indicated by the document id.

For description data, where the bitext has been collected from two separate documents, metadata is given for both original patents.

Download

Parallel data:

You can download the MAREC data set, which contains the source documents, from TU Wien.

PatTR is also available from HeiDATA.

Training and test sets for several tasks are available:

  • Multi-task learning on text genres and IPC section (W√§schle and Riezler, 2012a), eacl12.tar.gz  (1.7GB, md5)
  • Multi-task learning on IPC sections (Simianer and Riezler, 2013), wmt13.tar.gz (204MB, md5)
  • Online learning for computer-assisted translation (W√§schle et al., 2013), mtsummit13.tar.gz (596M, md5)

Splitting the data

For creating custom training and test sets, an easy option is to split the corpus by document publication date. Note, that abstract and claims data contain a small amount (less than 1%) of duplicate and near-duplicate sentences due to multiple instances of the same patent document in the two corpora. To prevent overlap, make sure family ids of test and training set are disjunct. Furthermore, about 7% of the description data are duplicates. This is caused by the patent writing process, where whole paragraphs are copied verbatim from other documents, e.g. when parts of an invention are similar to a previously filed one. These documents do not share a patent id, so they cannot be easily identified. Indicators are mutual citations and documents filed by the same company. We did not remove these duplicates because they are a feature of patent corpora. Since patent titles are very short and general, 15% of title data are natural duplicates.

Statistics


Section Sentences en tokens de tokens Bitext size
title 2,101,107 16,457,527 13,212,645 248MB
abstract 720,571 30,942,571 26,803,868 383MB
claims 8,346,863 501,373,533 435,117,827 6.1GB
description 11,829,816 498,948,414 386,920,744 4.9GB
total 22,998,357 1,047,722,045 862,055,084 11.5GB

Section Sentences en tokens fr tokens Bitext size
title 2,504,772 19,458,540 23,605,412 307MB
abstract 3,697,670 130,801,982 144,591,792 1.73GB
claims 6,966,851 422,504,392 468,029,948 5.3GB
description 5,594,745 200,043,688 204,449,266 2.5GB
total 18,764,038 772,808,602 840,676,418 9.84GB

Section Sentences fr tokens de tokens Bitext size
title 1,953,815 18,337,771 12,229,339 252MB
abstract 122,440 5,816,764 4,594,012 74MB
claims 3,034,007 206,982,238 162,760,901 2.5GB
total 5,110,262 231,136,773 179,584,252 2.83GB

The numbers for de-en differ slightly from those reported in Wäschle and Riezler (2012b) due to some additional processing steps that were performed before the release.

Acknowledgments

The work was in part supported by the "Cross-language Learning-to-Rank for Patent Retrieval" project funded by the Deutsche Forschungsgemeinschaft (DFG).

Publications

Wäschle, K., Simianer, P., Bertoldi, N., Riezler, S., and Federico, M. (2013). Generative and Discriminative Methods for Online Adaptation in SMT. Proceedings of Machine Translation Summit XIV, Nice, France. [ bib | pdf | data ]

Simianer, P., and Riezler, S. (2013). Multi-Task Learning for Improved Discriminative Training in SMT. Proceedings of the Eighth Workshop on Statistical Machine Translation, Sofia, Bulgaria. [ bib | pdf | data ]

Wäschle, K. and Riezler, S. (2012a). Structural and Topical Dimensions in Multi-Task Patent Translation. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), Avignon, France. [ bib | pdf | data ]

Wäschle, K. and Riezler, S. (2012b). Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus. Multidisciplinary Information Retrieval, pp. 12-27. [ bib | pdf ]

zum Seitenanfang