JoeySpeech2Text: Minimalistic Speech-to-Text Modeling with JoeyNMT

JoeySpeech2Text is a JoeyNMT extension for speech-to-text tasks such as 
automatic speech recognition and end-to-end speech translation. It inherits the
core philosophy of JoeyNMT, a minimalist NMT toolkit built on PyTorch, seeking
simplicity and accessibility. JoeySpeech2Text's workflow is self-contained,
starting from data pre-processing, over model training and prediction to
evaluation, and is seamlessly integrated into JoeyNMT's compact and simple code
base. On top of JoeyNMT's state-of-the-art Transformer-based Encoder-Decoder
architecture, JoeySpeech2Text provides speech-oriented components such as
convolutional layers, SpecAugment, CTC-loss, and WER evaluation. Despite its
simplicity compared to prior implementations, JoeySpeech2Text performs
competitively on English speech recognition and English-to-German translation
benchmarks. The implementation is accompanied by a walk-through tutorial and
available on Github.