PatTR: Patent Translation Resource

PatTR is a sentence-parallel corpus extracted from the MAREC patent collection. The current version contains more than 22 million German-English and 18 million French-English parallel sentences collected from all patent text sections as well as 5 million German-French sentence pairs from patent titles, abstracts and claims.

Terms of Use

PatTR is available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Please cite (Wäschle & Riezler, 2012), if you use the corpus in your work, or use the data citation specified in the HeiDATA entry.

Data

The corpus is sorted by language pairs and by text sections of a patent document, namely title, abstract, claims and description. Parallel data from title, abstract and claims sections were extracted from documents belonging to the European Patent Office (EPO) and the World Intellectual Property Organization (WIPO) corpora in MAREC. Both resources feature multilingual documents that contain for example both an English and a German abstract.

Since there are no multilingual descriptions, data from this section were collected by exploiting patent families to align German and French documents from the EPO corpus to English documents from the United States Patent and Trademark Office (USPTO) corpus, following Utiyama and Isahara (2007).

All sections were sentence-aligned using the Gargantua aligner. Preprocessing was done automatically. Sentence boundaries were detected using the Europarl processing tools.

For a detailed description of the corpus construction process, please see the publications.

Metadata

In addition to the bitext we provide patent metadata for each sentence:

The patent id of the original document, which can be a patent application or a granted patent.
The patent family id that groups related documents with mostly overlapping content, e.g. patents for the same invention in different legislations.
Publication date.
Classification according to the IPC down to subclass level.

Further metadata, e.g. inventor or company, can be found in the original patent indicated by the document id.

For description data, where the bitext has been collected from two separate documents, metadata is given for both original patents.

Download

Parallel data:

de-en.tar.gz (2.8GB, md5: 55a074640806d29c9dcfcdb9346e6ce7)
en-fr.tar.gz (2.4GB, md5: 68cb277faf451b0206eb85c559d29c46)
fr-de.tar.gz (646MB, md5: 120484093f5f930fe8646eb3b3be76e3)

You can download the MAREC data set, which contains the source documents, from TU Wien.

PatTR is also available from HeiDATA.

Training and test sets for several tasks are available:

Multi-task learning on text genres and IPC section (Wäschle & Riezler, 2012)
eacl12.tar.gz (1.7GB, md5: f7afcd4cb5189cd8bfc33c95af556215)
Multi-task learning on IPC sections (Simianer & Riezler, 2013)
wmt13.tar.gz (204MB, md5: 20a51980a77af40df30e283a3c33b77e)
Online learning for computer-assisted translation (Wäschle et al., 2013)
mtsummit13.tar.gz (596M, md5: c4b22fcec89aa9e13e26aa5f1db767f9)

Splitting the data

For creating custom training and test sets, an easy option is to split the corpus by document publication date. Note, that abstract and claims data contain a small amount (less than 1%) of duplicate and near-duplicate sentences due to multiple instances of the same patent document in the two corpora. To prevent overlap, make sure family ids of test and training set are disjunct. Furthermore, about 7% of the description data are duplicates. This is caused by the patent writing process, where whole paragraphs are copied verbatim from other documents, e.g. when parts of an invention are similar to a previously filed one. These documents do not share a patent id, so they cannot be easily identified. Indicators are mutual citations and documents filed by the same company. We did not remove these duplicates because they are a feature of patent corpora. Since patent titles are very short and general, 15% of title data are natural duplicates.

Statistics

Section	Sentences	en tokens	de tokens	Bitext size
title	2,101,107	16,457,527	13,212,645	248MB
abstract	720,571	30,942,571	26,803,868	383MB
claims	8,346,863	501,373,533	435,117,827	6.1GB
description	11,829,816	498,948,414	386,920,744	4.9GB
total	22,998,357	1,047,722,045	862,055,084	11.5GB

Section	Sentences	en tokens	fr tokens	Bitext size
title	2,504,772	19,458,540	23,605,412	307MB
abstract	3,697,670	130,801,982	144,591,792	1.73GB
claims	6,966,851	422,504,392	468,029,948	5.3GB
description	5,594,745	200,043,688	204,449,266	2.5GB
total	18,764,038	772,808,602	840,676,418	9.84GB

Section	Sentences	fr tokens	de tokens	Bitext size
title	1,953,815	18,337,771	12,229,339	252MB
abstract	122,440	5,816,764	4,594,012	74MB
claims	3,034,007	206,982,238	162,760,901	2.5GB
total	5,110,262	231,136,773	179,584,252	2.83GB

The numbers for de-en differ slightly from those reported in (Wäschle & Riezler, 2012) due to some additional processing steps that were performed before the release.

Acknowledgments

The work was in part supported by the “Cross-language Learning-to-Rank for Patent Retrieval” project funded by the Deutsche Forschungsgemeinschaft (DFG).

Publications

Katharina Wäschle and Stefan Riezler

Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus

Proceedings of the 5th Information Retrieval Facility Conference (IRFC), Vienna, Austria, 2012

pdf | bib

@inproceedings{waeschle2012a,
  author = {W\"{a}schle, Katharina and Riezler, Stefan},
  title = {Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus},
  journal = {Proceedings of the 5th Information Retrieval Facility Conference},
  journal-abbrev = {IRFC},
  year = {2012},
  city = {Vienna},
  country = {Austria},
  url = {http://www.cl.uni-heidelberg.de/~riezler/publications/papers/IRF2012.pdf}
}

Katharina Wäschle and Stefan Riezler

Structural and Topical Dimensions in Multi-Task Patent Translation

Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Avignon, France, 2012

pdf | bib

@inproceedings{waeschle2012b,
  author = {W\"{a}schle, Katharina and Riezler, Stefan},
  title = {Structural and Topical Dimensions in Multi-Task Patent Translation},
  journal = {Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics},
  journal-abbrev = {EACL},
  year = {2012},
  city = {Avignon},
  country = {France},
  url = {http://www.cl.uni-heidelberg.de/~riezler/publications/papers/EACL2012.pdf}
}

Patrick Simianer and Stefan Riezler

Multi-Task Learning for Improved Discriminative Training in SMT

Proceedings of the Workshop on Statistical Machine Translation (WMT), Sofia, Bulgaria, 2013

pdf | bib

@inproceedings{simianer2013b,
  author = {Simianer, Patrick and Riezler, Stefan},
  title = {Multi-Task Learning for Improved Discriminative Training in SMT},
  journal = {Proceedings of the Workshop on Statistical Machine Translation},
  journal-abbrev = {WMT},
  year = {2013},
  city = {Sofia},
  country = {Bulgaria},
  url = {https://www.cl.uni-heidelberg.de/~riezler/publications/papers/WMT2013.pdf}
}

Katharina Wäschle, Patrick Simianer, Nicola Bertoldi, Stefan Riezler and Marcello Federico

Generative and Discriminative Methods for Online Adaptation in SMT

Proceedings of MT SUMMIT XIV, Nice, France, 2013

pdf | bib

@inproceedings{waeschle2013,
  author = {W\"{a}schle, Katharina and Simianer, Patrick and Bertoldi, Nicola and Riezler, Stefan and Federico, Marcello},
  title = {Generative and Discriminative Methods for Online Adaptation in SMT},
  journal = {Proceedings of MT SUMMIT XIV},
  year = {2013},
  city = {Nice},
  country = {France},
  url = {https://www.cl.uni-heidelberg.de/~riezler/publications/papers/MTSUMMIT13.pdf}
}