End-to-End Speech Translation: Opportunities and Challenges Over its three decade history, speech translation has experienced several shifts in its primary research themes; moving from loosely coupled cascades of speech recognition and machine translation, to exploring questions of tight coupling, and finally to end-to-end models that have recently attracted much attention. This talk will begin with a discussion of the main challenges of traditional approaches which stem from committing to intermediate representations from the speech recognizer, and from training cascaded models separately towards different objectives. We will then focus on recent end-to-end modeling techniques, which promise a principled way of overcoming these issues by allowing joint training of all model components and removing the need for explicit intermediate representations. However, such models come with new challenges, of which we will highlight two. The first challenge we discuss stems from the fact that speech translation data is expensive and scarce. Strategies are needed to incorporate general speech recognition and machine translation training data into end-to-end model training in a way that achieves competitive data efficiency as compared to traditional cascades, while at the same time not compromising on the inherent advantages of end-to-end models. The second discussed challenge stems from the fact that many real-life speech translation applications, such as typical two-way conversational translators, require displaying both translation and transcript to users at the same time. This goal is in tension with recent end-to-end modeling efforts that are often aimed at removing the modeling of transcripts. We will compare several end-to-end approaches for jointly generating transcripts and translations, and show that coupled inference procedures are needed in order to avoid undesirable inconsistencies between both outputs. We will therefore introduce metrics that capture consistency between transcripts and translations quantitatively, and demonstrate that joint end-to-end models are able to outperform traditional cascades not only in terms of accuracy, but also produce more consistent transcripts and translations.