LibriVoxDeEn: A corpus for German-to-English Speech Translation and Speech Recognition

This dataset is a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences. The speech data are low in disfluencies because of the audio book setup. The quality of audio and sentence alignments has been checked by a manual evaluation, showing that that speech alignment is in general very high. The sentence alignment quality is comparable to well-used parallel translation data and can be adjusted by cutoffs on the automatic alignment score. To our knowledge, this corpus is to date the largest resource for end-to-end speech translation for German.

Terms of Use

LibriVoxDeEn is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Please cite (Beilharz et al., 2020), if you use the corpus in your work.

Data

	#books	#chapters	#sentences	#hours	#w_source	#unique w_source	#w_target	#unique w_target
DE	86	1,556	419,449	547	4,082,479	264,049	-	-
DE-EN (aligned)	19	365	[DE]53,168, [EN]50,883	133	898,676	93,308	989,768	65,855
DE-EN (filtered)	19	365	[DE]50,427, [EN]50,883	110	860,369	50,072	948,565	40,811

Folder structure

de
- contains german text files for each book
en
- contains english text files for each book
audio
- alignment maps by aeneas
- segmented audio files
tables
- text2speech.tsv -> lookup table for German text-to-speech alignments
- text2text.tsv -> lookup table for German-English text-to-text alignments
extract.py: tool to extract and sample data from data set

Get the dataset:

Our corpus is available on HeiDATA (Version 1.01 published).

Example usage

Requirements

python3.5<=
pandas

Parameters & Arguments

--all: retrieves complete lookup table
-i [int] Book Number: retrieve book with starting index of arg
-t [float] Treshhold: only show entries with a higher score than arg
-b [int] Batch Size: get batch of entries of size arg

Retrieving a specific book: python3 extract.py -i 11 [-t 0.5]

Retrieving a batch of entries: python3 extract.py -b 5000 -t 0.5

Retrieve all data: python3 extract.py --all

All of these methods result in a data.tsv (amount of rows given by batch_size) file with the sorted, sampled entries. For example:

book	audio	score	de_sentence	en_sentence	#w_de	#w_en
18.undine	00001-undine10.wav	0.63	Ja, als er die Augen nach dem Walde aufhob, kam es ihm ganz eigentlich vor, als sehe er durch das Laubgegitter den nickenden Mann hervorkommen.	Indeed, when he raised his eyes toward the wood it seemed to him as if he actually saw the nodding man approaching through the dense foliage.	25	26

Column explanation

book: the actual book file
audio: the audio file for the German sentence
score: confidence score given by hunalign
de_sentence: source sentence in German
en_sentence: corresponding English sentence
#w_de: number of words on source side
#w_en: number of words on target side

Fine-tuning data split

Train/dev/test split of four selected audio books for fine-tuning using cyclic feedback. Please see (Lam et al., 2021) for details.

librivoxdeen_fine_tuning.tar.gz (518kB, md5: 2bd87fc70385a193f48e0d116ec8d83d)

Acknowledgments

The research reported in this paper was supported in part by the German research foundation (DFG) under grant RI2221/4-1.

Publications

Benjamin Beilharz, Xin Sun, Sariya Karimova and Stefan Riezler

LibriVoxDeEn: A Corpus for German-to-English Speech Translation and Speech Recognition

Proceedings of the Language Resources and Evaluation Conference (LREC), Marseille, France, 2020

pdf | bib

@article{beilharz19,
  title = {LibriVoxDeEn: A Corpus for German-to-English Speech Translation and Speech Recognition},
  author = {Beilharz, Benjamin and Sun, Xin and Karimova, Sariya and Riezler, Stefan},
  journal = {Proceedings of the Language Resources and Evaluation Conference},
  journal-abbrev = {LREC},
  year = {2020},
  city = {Marseille, France},
  url = {https://arxiv.org/pdf/1910.07924.pdf}
}

Tsz Kin Lam, Shigehiko Schamoni and Stefan Riezler

Cascaded Models With Cyclic Feedback For Direct Speech Translation

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021

pdf | bib

@inproceedings{lam2020,
  author = {Lam, Tsz Kin and Schamoni, Shigehiko and Riezler, Stefan},
  year = {2021},
  title = {Cascaded Models With Cyclic Feedback For Direct Speech Translation},
  journal = {IEEE International Conference on Acoustics, Speech and Signal Processing},
  journal-abbrev = {ICASSP},
  url = {http://arxiv.org/abs/2010.11153}
}