LibriVoxDeEn: A corpus for German-to-English Speech Translation and Speech Recognition

This dataset is a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences. The speech data are low in disfluencies because of the audio book setup. The quality of audio and sentence alignments has been checked by a manual evaluation, showing that that speech alignment is in general very high. The sentence alignment quality is comparable to well-used parallel translation data and can be adjusted by cutoffs on the automatic alignment score. To our knowledge, this corpus is to date the largest resource for end-to-end speech translation for German.

Terms of Use

LibriVoxDeEn is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Please cite (Beilharz, Sun, Karimova, & Riezler, 2019), if you use the corpus in your work.

Data

  #books #chapters #sentences #hours #wsource #unique wsource #wtarget #unique wtarget
DE 86 1,556 419,449 547 4,082,479 264,049 - -
DE-EN (aligned) 19 365 [DE]53,168, [EN]50,883 133 898,676 93,308 989,768 65,855
DE-EN (filtered) 19 365 [DE]50,427, [EN]50,883 110 860,369 50,072 948,565 40,811

Folder structure

  • de
    • contains german text files for each book
  • en
    • contains english text files for each book
  • audio
    • alignment maps by aeneas
    • segmented audio files
  • tables
    • text2speech.tsv -> lookup table for German text-to-speech alignments
    • text2text.tsv -> lookup table for German-English text-to-text alignments
  • extract.py: tool to extract and sample data from data set

Get the dataset:

Our corpus is available on HeiDATA (Version 1.0 published).

Example usage

Requirements

  • python3.5<=
  • pandas

Parameters & Arguments

  • --all: retrieves complete lookup table
  • -i [int] Book Number: retrieve book with starting index of arg
  • -t [float] Treshhold: only show entries with a higher score than arg
  • -b [int] Batch Size: get batch of entries of size arg

Retrieving a specific book: python3 extract.py -i 11 [-t 0.5]

Retrieving a batch of entries: python3 extract.py -b 5000 -t 0.5

Retrieve all data: python3 extract.py --all

All of these methods result in a data.tsv (amount of rows given by batch_size) file with the sorted, sampled entries. For example:

book audio score de_sentence en_sentence #w_de #w_en
18.undine 00001-undine10.wav 0.63 Ja, als er die Augen nach dem Walde aufhob, kam es ihm ganz eigentlich vor, als sehe er durch das Laubgegitter den nickenden Mann hervorkommen. Indeed, when he raised his eyes toward the wood it seemed to him as if he actually saw the nodding man approaching through the dense foliage. 25 26

Column explaination

  • book: the actual book file
  • audio: the audio file for the German sentence
  • score: confidence score given by hunalign
  • de_sentence: source sentence in German
  • en_sentence: corresponding English sentence
  • #w_de: number of words on source side
  • #w_en: number of words on target side

Acknowledgments

The research reported in this paper was supported in part by the German research foundation (DFG) under grant RI2221/4-1.

Publication

  1. Benjamin Beilharz, Xin Sun, Sariya Karimova and Stefan Riezler
    LibriVoxDeEn: A Corpus for German-to-English Speech Translation and Speech Recognition
    arXiv preprint arXiv:1910.07924, 2019
    @article{beilharz19,
      title = {LibriVoxDeEn: A Corpus for German-to-English Speech Translation and Speech Recognition},
      author = {Beilharz, Benjamin and Sun, Xin and Karimova, Sariya and Riezler, Stefan},
      journal = {arXiv preprint arXiv:1910.07924},
      year = {2019},
      url = {https://arxiv.org/pdf/1910.07924.pdf}
    }