LibriVoxDeEn: A corpus for German-to-English Speech Translation and Speech Recognition

This dataset is a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences. The speech data are low in disfluencies because of the audio book setup. The quality of audio and sentence alignments has been checked by a manual evaluation, showing that that speech alignment is in general very high. The sentence alignment quality is comparable to well-used parallel translation data and can be adjusted by cutoffs on the automatic alignment score. To our knowledge, this corpus is to date the largest resource for end-to-end speech translation for German.

Terms of Use

LibriVoxDeEn is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Please cite (Beilharz et al., 2020), if you use the corpus in your work.

Data

  #books #chapters #sentences #hours #wsource #unique wsource #wtarget #unique wtarget
DE 86 1,556 419,449 547 4,082,479 264,049 - -
DE-EN (aligned) 19 365 [DE]53,168, [EN]50,883 133 898,676 93,308 989,768 65,855
DE-EN (filtered) 19 365 [DE]50,427, [EN]50,883 110 860,369 50,072 948,565 40,811

Folder structure

  • de
    • contains german text files for each book
  • en
    • contains english text files for each book
  • audio
    • alignment maps by aeneas
    • segmented audio files
  • tables
    • text2speech.tsv -> lookup table for German text-to-speech alignments
    • text2text.tsv -> lookup table for German-English text-to-text alignments
  • extract.py: tool to extract and sample data from data set

Get the dataset:

Our corpus is available on HeiDATA (Version 1.01 published).

Example usage

Requirements

  • python3.5<=
  • pandas

Parameters & Arguments

  • --all: retrieves complete lookup table
  • -i [int] Book Number: retrieve book with starting index of arg
  • -t [float] Treshhold: only show entries with a higher score than arg
  • -b [int] Batch Size: get batch of entries of size arg

Retrieving a specific book: python3 extract.py -i 11 [-t 0.5]

Retrieving a batch of entries: python3 extract.py -b 5000 -t 0.5

Retrieve all data: python3 extract.py --all

All of these methods result in a data.tsv (amount of rows given by batch_size) file with the sorted, sampled entries. For example:

book audio score de_sentence en_sentence #w_de #w_en
18.undine 00001-undine10.wav 0.63 Ja, als er die Augen nach dem Walde aufhob, kam es ihm ganz eigentlich vor, als sehe er durch das Laubgegitter den nickenden Mann hervorkommen. Indeed, when he raised his eyes toward the wood it seemed to him as if he actually saw the nodding man approaching through the dense foliage. 25 26

Column explanation

  • book: the actual book file
  • audio: the audio file for the German sentence
  • score: confidence score given by hunalign
  • de_sentence: source sentence in German
  • en_sentence: corresponding English sentence
  • #w_de: number of words on source side
  • #w_en: number of words on target side

Fine-tuning data split

Train/dev/test split of four selected audio books for fine-tuning using cyclic feedback. Please see (Lam et al., 2021) for details.

Acknowledgments

The research reported in this paper was supported in part by the German research foundation (DFG) under grant RI2221/4-1.

Publications

  1. Benjamin Beilharz, Xin Sun, Sariya Karimova and Stefan Riezler
    LibriVoxDeEn: A Corpus for German-to-English Speech Translation and Speech Recognition
    Proceedings of the Language Resources and Evaluation Conference (LREC), Marseille, France, 2020
    @article{beilharz19,
      title = {LibriVoxDeEn: A Corpus for German-to-English Speech Translation and Speech Recognition},
      author = {Beilharz, Benjamin and Sun, Xin and Karimova, Sariya and Riezler, Stefan},
      journal = {Proceedings of the Language Resources and Evaluation Conference},
      journal-abbrev = {LREC},
      year = {2020},
      city = {Marseille, France},
      url = {https://arxiv.org/pdf/1910.07924.pdf}
    }
    
  2. Tsz Kin Lam, Shigehiko Schamoni and Stefan Riezler
    Cascaded Models With Cyclic Feedback For Direct Speech Translation
    IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021
    @inproceedings{lam2020,
      author = {Lam, Tsz Kin and Schamoni, Shigehiko and Riezler, Stefan},
      year = {2021},
      title = {Cascaded Models With Cyclic Feedback For Direct Speech Translation},
      journal = {IEEE International Conference on Acoustics, Speech and Signal Processing},
      journal-abbrev = {ICASSP},
      url = {http://arxiv.org/abs/2010.11153}
    }