Ruprecht-Karls-Universität Heidelberg
Institut für Computerlinguistik

Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg

Optimizing Data Usage in Neural Sequence-To-Sequence Learning

Module Description

Course Module Abbreviation Credit Points
BA-2010[100%|75%] CS-CL 6 LP
BA-2010[50%] BS-CL 6 LP
BA-2010[25%] BS-AC 4 LP
BA-2010 AS-CL 8 LP
Master SS-CL, SS-TAC 8 LP
Lecturer Tsz Kin Lam
Module Type Hauptseminar
Language English
First Session 28.10.2021
Time and Place Thursday, 11:15-12:45, Online
Commitment Period tba

Prerequisite for Participation

  • Good knowledge of statistical machine learning (e.g., by successful completion of courses "Statistical Methods for Computational Linguistics"; and/or "Neural Networks: Architectures and Applications for NLP";) and experience in experimental work (e.g., software project or seminar implementation project) and basic knowledge of Sequence- To-Sequence Learning.
  • Assessment

    • Regular and active participation: reading research papers and asking questions in class
    • Oral presentation of (a) selected paper(s)
    • Implementation project


    Deep learning is the de facto standard for many classification tasks, e.g., natural language processing or image recognition. However, it is also notorious of being data hungry. This data hungry nature, together with the costly annotation process, has stimulated a lot of research on creating synthetic data, aka data augmentation. Another popular method is to create additional data by crawling data on web, aka data crawling. Both approaches allow substantial increases of training data at little cost. However, synthetic or crawled data can be noisy, e.g., due to misalignments between source and target sentences, or due to a domain mismatch between new data and original training data. This casts doubt on the benefits of such additional data to the final model performance, and is also the place where data selection comes into play.

    The focus of this seminar is on optimizing data usage of neural sequence-to-sequence learning in text data. Participants will learn about the recent advances of data selection, data augmentation and their connections to multi-domain scenarios. The application focus will be sequence-to-sequence learning, especially machine translation.

    Topics will include (but not limited to):

    • Data selection/filtering
    • Data augmentation and adversarial inputs
    • Generalization over multiple domains

    Module Overview


    Date Session Materials


    » More Materials

zum Seitenanfang