PatTR: Patent Translation Resource
PatTR is a sentence-parallel corpus extracted from the MAREC patent collection. The current version contains more than 22 million German-English and 18 million French-English parallel sentences collected from all patent text sections as well as 5 million German-French sentence pairs from patent titles, abstracts and claims.
Terms of use
PatTR is available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Please cite Wäschle and Riezler (2012b), if you use the corpus in your work, or use the data citation specified in the HeiDATA entry.

Data
The corpus is sorted by language pairs and by text sections of a patent document, namely title, abstract, claims and description. Parallel data from title, abstract and claims sections were extracted from documents belonging to the European Patent Office (EPO) and the World Intellectual Property Organization (WIPO) corpora in MAREC. Both resources feature multilingual documents that contain for example both an English and a German abstract.
Since there are no multilingual descriptions, data from this section were collected by exploiting patent families to align German and French documents from the EPO corpus to English documents from the United States Patent and Trademark Office (USPTO) corpus, following Utiyama and Isahara (2007).
All sections were sentence-aligned using the Gargantua aligner. Preprocessing was done automatically. Sentence boundaries were detected using the Europarl processing tools.
For a detailed description of the corpus construction process, please see the publications.
Metadata
In addition to the bitext we provide patent metadata for each sentence:
- The patent id of the original document, which can be a patent application or a granted patent.
- The patent family id that groups related documents with mostly overlapping content, e.g. patents for the same invention in different legislations.
- Publication date.
- Classification according to the IPC down to subclass level.
Further metadata, e.g. inventor or company, can be found in the original patent indicated by the document id.
For description data, where the bitext has been collected from two separate documents, metadata is given for both original patents.
Download
Parallel data:
- de-en.tar.gz (2.8GB, md5)
- en-fr.tar.gz (2.4GB, md5)
- fr-de.tar.gz (646MB, md5)
You can download the MAREC data set, which contains the source documents, from TU Wien.
PatTR is also available from HeiDATA.
Training and test sets for several tasks are available:
- Multi-task learning on text genres and IPC section (Wäschle and Riezler, 2012a), eacl12.tar.gz (1.7GB, md5)
- Multi-task learning on IPC sections (Simianer and Riezler, 2013), wmt13.tar.gz (204MB, md5)
- Online learning for computer-assisted translation (Wäschle et al., 2013), mtsummit13.tar.gz (596M, md5)
Splitting the data
For creating custom training and test sets, an easy option is to split the corpus by document publication date. Note, that abstract and claims data contain a small amount (less than 1%) of duplicate and near-duplicate sentences due to multiple instances of the same patent document in the two corpora. To prevent overlap, make sure family ids of test and training set are disjunct. Furthermore, about 7% of the description data are duplicates. This is caused by the patent writing process, where whole paragraphs are copied verbatim from other documents, e.g. when parts of an invention are similar to a previously filed one. These documents do not share a patent id, so they cannot be easily identified. Indicators are mutual citations and documents filed by the same company. We did not remove these duplicates because they are a feature of patent corpora. Since patent titles are very short and general, 15% of title data are natural duplicates.
Statistics
Section | Sentences | en tokens | de tokens | Bitext size |
title | 2,101,107 | 16,457,527 | 13,212,645 | 248MB |
abstract | 720,571 | 30,942,571 | 26,803,868 | 383MB |
claims | 8,346,863 | 501,373,533 | 435,117,827 | 6.1GB |
description | 11,829,816 | 498,948,414 | 386,920,744 | 4.9GB |
total | 22,998,357 | 1,047,722,045 | 862,055,084 | 11.5GB |
Section | Sentences | en tokens | fr tokens | Bitext size |
title | 2,504,772 | 19,458,540 | 23,605,412 | 307MB |
abstract | 3,697,670 | 130,801,982 | 144,591,792 | 1.73GB |
claims | 6,966,851 | 422,504,392 | 468,029,948 | 5.3GB |
description | 5,594,745 | 200,043,688 | 204,449,266 | 2.5GB |
total | 18,764,038 | 772,808,602 | 840,676,418 | 9.84GB |
Section | Sentences | fr tokens | de tokens | Bitext size |
title | 1,953,815 | 18,337,771 | 12,229,339 | 252MB |
abstract | 122,440 | 5,816,764 | 4,594,012 | 74MB |
claims | 3,034,007 | 206,982,238 | 162,760,901 | 2.5GB |
total | 5,110,262 | 231,136,773 | 179,584,252 | 2.83GB |
The numbers for de-en differ slightly from those reported in Wäschle and Riezler (2012b) due to some additional processing steps that were performed before the release.
Acknowledgments
The work was in part supported by the "Cross-language Learning-to-Rank for Patent Retrieval" project funded by the Deutsche Forschungsgemeinschaft (DFG).
Publications
Wäschle, K., Simianer, P., Bertoldi, N., Riezler, S., and Federico, M. (2013). Generative and Discriminative Methods for Online Adaptation in SMT. Proceedings of Machine Translation Summit XIV, Nice, France. [ bib | pdf | data ]
Simianer, P., and Riezler, S. (2013). Multi-Task Learning for Improved Discriminative Training in SMT. Proceedings of the Eighth Workshop on Statistical Machine Translation, Sofia, Bulgaria. [ bib | pdf | data ]
Wäschle, K. and Riezler, S. (2012a). Structural and Topical Dimensions in Multi-Task Patent Translation. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), Avignon, France. [ bib | pdf | data ]
Wäschle, K. and Riezler, S. (2012b). Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus. Multidisciplinary Information Retrieval, pp. 12-27. [ bib | pdf ]