LibriVoxDeEn: A corpus for German-to-English Speech Translation and Speech Recognition
This dataset is a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audio books. The corpus consists of over 100 hours of audio material and over 50k parallel sentences. The speech data are low in disfluencies because of the audio book setup. The quality of audio and sentence alignments has been checked by a manual evaluation, showing that that speech alignment is in general very high. The sentence alignment quality is comparable to well-used parallel translation data and can be adjusted by cutoffs on the automatic alignment score. To our knowledge, this corpus is to date the largest resource for end-to-end speech translation for German.
Terms of Use
LibriVoxDeEn is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Please cite (Beilharz et al., 2020), if you use the corpus in your work.
Data
| #books | #chapters | #sentences | #hours | #wsource | #unique wsource | #wtarget | #unique wtarget | |
|---|---|---|---|---|---|---|---|---|
| DE | 86 | 1,556 | 419,449 | 547 | 4,082,479 | 264,049 | - | - | 
| DE-EN (aligned) | 19 | 365 | [DE]53,168, [EN]50,883 | 133 | 898,676 | 93,308 | 989,768 | 65,855 | 
| DE-EN (filtered) | 19 | 365 | [DE]50,427, [EN]50,883 | 110 | 860,369 | 50,072 | 948,565 | 40,811 | 
Folder structure
- de
    - contains german text files for each book
 
- en
    - contains english text files for each book
 
- audio
    - alignment maps by aeneas
- segmented audio files
 
- alignment maps by 
- tables
    - text2speech.tsv -> lookup table for German text-to-speech alignments
- text2text.tsv -> lookup table for German-English text-to-text alignments
 
- extract.py: tool to extract and sample data from data set
Get the dataset:
Our corpus is available on HeiDATA (Version 1.01 published).
Example usage
Requirements
- python3.5<=
- pandas
Parameters & Arguments
- --all: retrieves complete lookup table
- -i [int]Book Number: retrieve book with starting index of- arg
- -t [float]Treshhold: only show entries with a higher score than- arg
- -b [int]Batch Size: get batch of entries of size- arg
Retrieving a specific book:
python3 extract.py -i 11 [-t 0.5]
Retrieving a batch of entries:
python3 extract.py -b 5000 -t 0.5
Retrieve all data:
python3 extract.py --all
All of these methods result in a data.tsv (amount of rows given by batch_size) file with the sorted, sampled entries. For example:
| book | audio | score | de_sentence | en_sentence | #w_de | #w_en | 
|---|---|---|---|---|---|---|
| 18.undine | 00001-undine10.wav | 0.63 | Ja, als er die Augen nach dem Walde aufhob, kam es ihm ganz eigentlich vor, als sehe er durch das Laubgegitter den nickenden Mann hervorkommen. | Indeed, when he raised his eyes toward the wood it seemed to him as if he actually saw the nodding man approaching through the dense foliage. | 25 | 26 | 
Column explanation
- book: the actual book file
- audio: the audio file for the German sentence
- score: confidence score given by hunalign
- de_sentence: source sentence in German
- en_sentence: corresponding English sentence
- #w_de: number of words on source side
- #w_en: number of words on target side
Fine-tuning data split
Train/dev/test split of four selected audio books for fine-tuning using cyclic feedback. Please see (Lam et al., 2021) for details.
- librivoxdeen_fine_tuning.tar.gz (518kB, md5: 2bd87fc70385a193f48e0d116ec8d83d)
Acknowledgments
The research reported in this paper was supported in part by the German research foundation (DFG) under grant RI2221/4-1.
Publications
- LibriVoxDeEn: A Corpus for German-to-English Speech Translation and Speech RecognitionProceedings of the Language Resources and Evaluation Conference (LREC), Marseille, France, 2020@article{beilharz19, title = {LibriVoxDeEn: A Corpus for German-to-English Speech Translation and Speech Recognition}, author = {Beilharz, Benjamin and Sun, Xin and Karimova, Sariya and Riezler, Stefan}, journal = {Proceedings of the Language Resources and Evaluation Conference}, journal-abbrev = {LREC}, year = {2020}, city = {Marseille, France}, url = {https://arxiv.org/pdf/1910.07924.pdf} }
- Cascaded Models With Cyclic Feedback For Direct Speech TranslationIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021@inproceedings{lam2020, author = {Lam, Tsz Kin and Schamoni, Shigehiko and Riezler, Stefan}, year = {2021}, title = {Cascaded Models With Cyclic Feedback For Direct Speech Translation}, journal = {IEEE International Conference on Acoustics, Speech and Signal Processing}, journal-abbrev = {ICASSP}, url = {http://arxiv.org/abs/2010.11153} }