Ruprecht-Karls-Universität Heidelberg




    Research Grant: "Auto-Adaptive Learning from Weak Feedback
    for Interactive Lecture Translation" (PI)

    Summary: High-quality machine translation requires the interaction with human translators, either in form of post-edits or interactive translation prediction. The high cost and required expertise of professional translators calls for a scenario where machine translation systems can be learned from weaker feedback that is elicitable from laymen users. Similar to computational advertising, where systems are adapted from user clicks, we attempt to learn machine translation from bandit feedback in form of judgements on the quality of a predicted translation without requiring a post-edit or a gold-standard translation.



    Industry Research Cooperation (PI)

    Summary: Statistical Machine Translation is an enabling tool for e-commerce localization. It allows sellers to offer their products on international markets by automatic translation of their product descriptions. We started a cooperative research project with Amazon where we attempt to leverage weak feedback from users to improve machine translation for e-commerce applications.



    Research Network "SCIDATOS" (Co-PI)

    Summary: SCIDATOS is a research network on Scientific Computing for Improved Detection and Therapy of Sepsis, together with the Center for Scientific Computing (IWR) Heidelberg and the University Medical Center Mannheim (UMM). The goal of the project is the reliable diagnosis of sepsis and its timely therapy in critically ill patients. Our approach is a combination of machine learning and simulation, based on clinical data as well as on free-text formats of electronic health records.



    Research Grant: "Grounding SMT in Perception and Action" (PI)

    Summary: Grounded statistical machine translation (SMT) introduces the concept of a task-specific evaluation of translation quality and offers the opportunity to deploy task-specific feedback on translations as data for learning SMT systems. The main challenge of our project is the investigation of new mechanisms to train and evaluate SMT systems by grounding them in interactions with the world. We will focus on response-based learning in which the only supervision signal available to the learner is the response from acting in the world. An example are translations of executable database queries where a supervision signal can be extracted from executing the translated query against the database. Another example is feedback from human translators in grounded scenarios.



    Research Grant: "Weakly Supervised Learning of Cross-Lingual Systems" (PI)

    Summary: Cross-lingual rankings for information retrieval can be learned directly from data that are weakly supervised by relevance indicators such as citations in patents or hyperlinks in Wikipedia pages, but are not strictly parallel. We intend to turn this idea on its head by applying the techniques that have been successful for learning-to-rank for cross-lingual retrieval to discriminative training of machine translation on massive non-parallel data, and in the process, further improve methods for cross-lingual retrieval. The key ingredients of our proposed techniques will be the combination of learning from weakly supervised data with techniques that best deploy the weak supervision signals by using fine-grained sparse features and attempt at learning from positive and negative examples. We motivate our research by an application to translation and cross-lingual retrieval in the medical domain where massive amounts of quasi-parallel training data are available on the Internet, in research publications, and patent data.



    Research Grant: "Cross-lingual Learning-to-Rank for Patent Retrieval" (PI)

    Summary: Prior art search is an important tool to determine a patent's novelty and to avoid patent infringement. The task involves two problems, patent translation and patent retrieval, that need to be solved in multiple languages. Because of a highly specialized jargon and a multitude of patent domains, both tasks are considered difficult on their own. While most previous approaches have addressed translation and search as separate problems, we propose a synergetic combination of patent translation and patent search in a well-defined machine learning framework. Patent search is defined as a monolingual learning-to-rank problem that optimizes the ranking of prior art patents for patent queries. Patent translation is defined as a multi-task learning problem that optimizes translation quality across multiple patent domains. The translation system utilizes patent search by directly incorporating a translation's contribution to search quality in optimization of translation parameters. The goal of the project is to show the mutual benefit of this integration of translation and search.



    Graduate program "Coherence in Language Processing: Semantics beyond the Sentence", funded by Landesgraduiertenf├Ârderung, Ministerium f├╝r Wissenschaft, Forschung, und Kunst,
    Baden-W├╝rttemberg (Co-PI)

    Summary: Humans effortlessly perform the task of combining information from individual sentences in text (discourse) into an understanding of complete situations (that is, to create coherence), even though semantic relations between sentences often remain unexpressed and must be inferred from background and world knowledge. Current systems for natural language processing (NLP) deal fairly well with semantic analysis for individual sentences, but are generally unable to combine information from multiple sentences or to retrieve unexpressed information. This limits their utility in tasks where whole texts must be processed, such as machine translation or question answering. The goal of this graduate program is to extend semantic analysis to the discourse level and to approximate coherence-based interpretation.



    Google Faculty Research Award:
    "Learning Answer Quality Rankings for Non-Factoid Question Answering" (PI)

    Summary: Community QA portals provide an important resource for non-factoid question-answering. The inherent noisiness of user-generated data makes the identification of high-quality content challenging but all the more important. The project presents an approach to answer ranking and shows the usefulness of features that explicitly model answer quality. Furthermore, we introduce the idea of leveraging snippets of web search results for query expansion in answer ranking. We present an evaluation setup that avoids spurious results reported in earlier work. Our results show the usefulness of our features and query expansion techniques, and point to the importance of regularization when learning from noisy data.
zum Seitenanfang