StatNLP Heidelberg

RL in NMT: The Good, the Bad and the Ugly

2020-01-16T00:00:00+00:00

Let me introduce you to three popular practices for using reinforcement learning (RL) in neural machine translation (NMT): the Good, combining it with good old maximum likelihood estimation (MLE), the Ugly, combining it with “hacks”, and the Bad, applying it with ignorance of more evolved techniques. Those three are helping NMT researchers on the hunt for BLEU scores.

Western movies aside, the aim of this blogpost is to take a critical look at the recent trend to include RL-inspired objectives in NMT training. We’ll start with a recap of RL training in NMT, dive right into an empirical study by Wu et al. 2018, leading to the discussion of the three following questions:

How do NMT and RL fit together?
Why do we even get any benefits from an RL objective in supervised learning?
Where can we find the real challenges?

tl;dr RL is a popular first-aid method to fix supervised NMT training, but maybe not the most suitable one. RL shines outside supervised learning; new challenges and opportunities are to be found there.

The Basics

Introducing RL to incorporate rewards.

Maximum Likelihood Estimation

Standard auto-regressive NMT models, parametrized by a neural network with parameters $\theta$ , are trained with maximum likelihood estimation on parallel data $(x, y) \in \mathcal{D}$ resulting in the popular cross-entropy objective:

$\begin{align} \text{MLE} &= \sum_{(x,y) \in D} \log p_{\theta}(y \mid x) \end{align}$

Expected Reward

So how does RL come into play? The idea is to introduce rewards to encourage model outputs that would obtain a high reward, not only the perfect reference translation (=MLE). In practice, rewards can be simulated with e.g., sentence-level BLEU scores, to reinforce samples that – if evaluated – would obtain a high BLEU score. You might ask yourself, why is it even necessary? We’ll discuss that in a bit. Assuming the existence of such scalar rewards obtained from $R: \mathcal{Y} \to [0,1]$ we can formulate an objective that aims to maximize the expected reward for all model outputs:

$\begin{align} \text{RL} &= \mathbb{E}_{p_{\theta}(y \mid x)} \left[ R(y) \right] \end{align}$

Policy Gradient

In contrast to the MLE objective, the RL objective is not differentiable with respect to $\theta$ , because the reward is a discrete function of the outputs of the model. Luckily, with the help of the log-derivative trick, we can reformulate the gradient for this objective, also referred to as the policy gradient:

$\begin{align} \nabla_{\theta} \text{RL} &= \mathbb{E}_{p_{\theta}(y \mid x)} \left[ R(y) \nabla_{\theta} \log p_{\theta}(y \mid x)\right] \end{align}$

We can now empirically estimate the gradient with e.g. Monte Carlo sampling and train our model with stochastic gradient ascent. This solution was introduced in the famous REINFORCE algorithm by J. Williams 1992. REINFORCE proposes to estimate the gradient with one sample for each input:

$\begin{align} \tilde{\nabla}_{\theta} \text{RL} &= R(\tilde{y}) \nabla_{\theta} \log p_{\theta}(\tilde{y} \mid x),& \tilde{y} \sim p_{\theta}(y \mid x) \end{align}$

How does this bring us to RL? In RL, more precisely in policy search, $p_{\theta}$ is a policy that predicts actions $y$ . The policy chooses one action and then receives a reward for this action from the environment. Importantly, it is not possible to go back and try other actions instead and compare their rewards. In a genuine RL setup, we are limited to single-sample estimates.

Training

The current practice in NMT is to approximate the policy gradient with either multinomial sampling from the softmax-normalized outputs of the NMT model, or by beam search. The two objectives are trained either sequentially (e.g., supervised pre-training before reinforced fine-tuning, or alternating batches) or simultaneously (e.g., by linear interpolation).

Discussing Recent Trends of RL in NMT

If we care about BLEU, RL alone won’t help.

In the recent EMNLP paper “A Study of RL for NMT” Wu et al. 18 observe that RL-inspired training objectives have been shown to improve NMT quality, but usually don’t come without tricks and rather weak baselines. Their question is now: Combining various variants of these tricks with learning from monolingual data, does RL still shine as expected?

To spoil the suspense right away, the study finds that using RL leads to marginal improvements over well-tuned baselines, also in combination with MLE and monolingual data (the good). However, the largest portions of improvement come from leveraging additional monolingual data (old news) (the ugly). But the RL-inspired approaches evaluated here lack comparisons to more evolved techniques, and assume access to reference translations (the bad). Let’s take a closer look!

RL Tricks

Variance Reduction

The variance of the gradient estimator can be a problem for optimization, i.e. slow down or hinder convergence. The paper investigates the following solutions:

Average reward baseline: Instead of using the reward directly, subtract its empirical average from the reward obtained.
Learned baseline: Subtract a learned reward instead of the empirical average. The learned reward is the output of a regression model, e.g. another neural network.

The baseline was actually already proposed in the original REINFORCE paper and can be interpreted as an additive control variate (Ross 2013). Actor-critic (AC) approaches go a step further and replace the reward obtained by the environment by a reward given by a critic that is trained to imitate the original reward (applied to NMT by e.g. Bahdanau et al. 2017, Ngyuen et al. 2017).

Despite the reported effectiveness in practice, Greensmith et al. 2004 showed that both above solutions are suboptimal and that one can actually learn an optimal baseline with minimal variance.

One important aspect that has been completely neglected in the present study is that the number of samples used for the Monte Carlo gradient estimate has an essential influence of the variance of the gradient. If rewards are simulated anyway, e.g., from references using sentence-level BLEU, why not sample multiple times and average the gradients over this subset? This may sound familiar, since this is exactly what is done in minimum risk training (proposed for NMT by Shen et al. 2016).

In Wu et al. 2018’s empirical study, there was no beneficial effect observed when using the learned baseline. This contradicts with the experience in e.g., Bahdanau et al. 2017 and Kreutzer et al. 2017. The conclusion that reward baselines are not necessary from “the economic perspective” (Wu et al. 2018) might be a bit overhasty.

Reward Shaping

If the reward is only obtained at the end of each sequence (here: translation), how does the model know where the errors are? The problem of credit assignment is often addressed by introducing methods for reward shaping (Ng et al. 1999). Wu et al. 2018 investigate the implementation by Bahdanau et al. 2017: For each element of the output, the individual reward is the difference between the BLEU score for the partial output including and the BLEU score for the partial output excluding the element: $R(y_t) = R(y_{1:t}) - R(y_{1:t-1})$ . Note that the BLEU scores are computed with respect to the full reference output. Once again the references are exploited to simulate the rewards.

But does this even address the original problem of credit assignment? The problem arose because we had to wait for rewards from the environment until we completed a sequence of actions (in NMT: produced a complete translation). As soon as references are used, we are in principle not restricted to delayed rewards anymore. One could for example compare each word in the translation to the words in the reference translation and then come up with a token-based reward. Simple binary rewards were for example proposed in Petrushkov et al. 2018.

As long as we simulate the rewards using references, we can cheat our way around the real problem. When references are not available and you simply cannot compute BLEU scores for any arbitrary, partial translation – what would you do?

To this end Ngyuen et al. 2017 adopt the advantage-actor critic (A2C) framework (Mnih et al. 2016). A critic network predicts the expected future reward for each element, although the reward from the environment (here: BLEU) is only obtained at the end of the sequence. Unfortunately, the latter study does not include a comparison to RL approaches without reward shaping.

The empirical gains from reward shaping reported in Wu et al. 2018’s study are diminishingly small which leaves the question of the usefulness of this method unanswered.

Using Monolingual Data

Target-side

Leveraging monolingual data for improving MT systems has become increasingly popular, since simple methods have shown to be very effective for NMT. When target-side monolingual data is available, the trick-of-the-trade is to use back-translation as demonstrated in Sennrich et al. 2016. The only burden here is here one has to train a system in the opposite translation direction. This system can then generate pseudo-sources for the available target data. The “hallucinated” parallel data can then be used for standard training, with simulated rewards or without.

But isn’t it problematic to feed the NMT with fake data? Apparently not, at least as long as the targets are intact. Edunov et al. 2018a investigate this question systematically and surprisingly find that models get even better when the pseudo-sources are of low quality (not for small data, though). They hypothesize that the noise introduced actually enriches the training data and helps learning as e.g., in denoising auto-encoders.

Source-side

Wu et al. 2018 propose to leverage not only target-side monolingual data, but also source-side monolingual data. Evoking techniques developed in the context of self-training, the idea is to let the model generate pseudo-targets for its own training. We have to assume that it is able to generate targets that are “good enough”, in the sense that the model can bootstrap itself. In practice, this is addressed by using beam search decoding for generating translations that are likely to have higher quality than sampled or greedy decoded targets.

Does the quality of the pseudo-targets matter? When they are part of the RL objective, they are only used to simulate rewards for sampled translations, which perhaps can absorb some of the noise. In supervised MLE training Wu et al. 2018 add them to the much larger original parallel data – the small amount of extra noise might be negligible. However, this has not been investigated systematically.

NMT as an RL problem

We only (mis-)use a subset of RL methods in NMT.

The “Study of RL in NMT” is limited to a very specific scenario where policy gradient is used for fine-tuning of well-trained models. What about other RL algorithms? RL researchers have in fact dealt with reinforced objectives as above for decades and have developed more sophisticated training algorithms (such as Trust Region Policy Optimization and Proximal Policy Optimization) than vanilla policy gradient. But that’s to be discussed in another blog post. Nevertheless, so far only policy gradient and actor-critic have become really popular for structured prediction tasks. So what’s wrong, are we just slow in adopting their algorithms?

In fact, it is not trivial to cast NMT, or more general structured prediction, as a standard (PO)MDP problem which is the basis for most RL algorithms: What is the environment? What is the state? Where does the reward come from? Translation researchers don’t agree on it (comparing e.g. definitions in Wu et al. 2018, Ngyuen et al. 2017, Bahdanau et al. 2017). It is in fact often more suitable to cast it as a simpler contextual bandit problem, aka bandit structured prediction (e.g., Sokolov et al. 2016, Kreutzer et al. 2017, Daumé III et al. 2018), as Hal Daumé III discussed in his blogpost on structured prediction and RL – you may see it as a one-state MDP.

What we can agree on is that in NMT we’re dealing with large and structured action spaces, where actions are discrete and rewards are sparse (and most of the time delayed) and potentially noisy. This calls for algorithms that are particularly suited for those methods, but neither REINFORCE nor AC address these issues in particular.

In fact, training NMT from scratch with pure RL objectives, i.e. cold-start RL, has so far not succeeded (despite Xia et al. 2016’s optimism).

RL to the rescue?

RL can improve NMT because it fixes problems of our standard objective.

What’s wrong with MLE training for NMT? Ranzato et al. 2016 elaborated on this when proposing the MIXER algorithm that mixes policy gradient-style updates with MLE. They identify the following problems:

Exposure bias: During training reference targets are fed to the model (=teacher forcing), while during inference the model has to produce outputs based on its own previous outputs.
Token-level objective (aka “loss-evaluation mismatch” in Wiseman and Rush 2016): In standard autoregressive NMT models, the sequence-level log-likelihood is decomposed as sum over token-level log-likelihoods. The training hence is optimized to finding the next perfect output token given the previous perfect token. During inference, however, we’re measuring the model’s quality with metrics like BLEU that evaluate whole sequences of outputs.

Algorithms like scheduled sampling (Bengio et al. 2015), DAgger (Ross et al. 2011) and DAD (Venkatraman et al. 2015) have been designed to reduce the exposure bias by gradually exposing the model with its own outputs during training (imitation learning).

The same effect is obtained when including some policy gradient in the training objective (e.g. in MIXER, MRT), since the gradient update is based on the log-likelihood of the model’s own output. It is directly optimized towards a sentence-level reward that is closer to the corpus BLEU we’re evaluating our models with. Furthermore, it can help with other non-differentiable objectives than the expected reward, e.g., for adversarial training (Wu et al. 2017, Yang et al. 2017. Or you might just use it to teach the NMT system what you actually want from it (beyond generating translations close to the reference), e.g., copying certain words of the input Pham et al. 2018.

Large gains using RL have been reported under domain shift, i.e., gains over baseline models that are not fine-tuned on the evaluation domain (e.g., Kreutzer et al. 2017, Petrushkov et al. 2018) or when combined with classic objectives (e.g., Wu et al. 2016, Ranzato et al. 2016). The above discussed paper demonstrates that without these factors, expected improvements vanish.

Most commonly, RL is exploited as a first aid for obvious MLE problems, in a fully-supervised setting where references are available and rewards are simulated. Why not use (or at least compare against) other training strategies that may be better suited for NMT and fix the above problems equally, as proposed e.g., in Edunov et al. 2018b, Shen et al. 2016 and Norouzi et al. 2016?

Beyond supervised learning

The challenges in RL for NLP lie outside supervised learning.

So what about more realistic uses of RL, e.g., where rewards cannot simply be simulated, or reward signals are not given as well-defined functions, or not available in unlimited amounts? In NLP, the following scenarios are evident:

Gold standard structures may not be available because of the cost or the lack of expertise of human annotators. Weaker signals such as human judgments on the quality of output structures may be easier to obtain and may require less expertise. This is the case for example in semantic parsing (Lawrence at al. 2018) or in machine translation (Kreutzer et al. 2018b).
In genuinely interactive settings where a system directly interacts with a human, the human responses can be interpreted as a weak signal how to further improve the system. A prime example is dialogue, where learning from human feedback has successfully been implemented to train systems e.g., for small-talk (Serban et al. 2017) and task-oriented dialogue (Su et al. 2016).
Systems that need to be heavily customized towards a user or domain. User preferences or ratings (that usually come for free) can be used to specifically adapt the system. In industrial settings, large-scale collections of feedback have been utilised in personalized news recommendation (Li et al. 2010) or e-commerce translations systems (Kreutzer et al. 2018a).

These scenarios bring challenges that can only partly be addressed by simulations and arise from the interaction with humans in real-life scenarios. The human factor entails several differences to the popular simulation scenarios of RL. Firstly, human rewards are not well-defined functions, but complex and inconsistent signals. Secondly, humans cannot be expected to provide feedback for unlimited amounts of outputs. Exciting challenges (“RL is hard”) like the collection of reliable feedback, building robustness against adversarial feedback, fair evaluation, and off-policy learning, are ahead of us!

So instead of asking the question “How to get high BLEU with RL-objectives?” let’s move to “How to learn from rewards with RL when we depend on them?”.

Languages: Thanks to Lei Li for translating this post into Chinese!

Acknowledgment: Thanks to Carolin Lawrence, Stefan Riezler and Jasmijn Bastings for their valuable and much needed feedback for improving this post.

Disclaimer: This blogpost reflects solely the opinion of the author, not any of her affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.

Comments, ideas and critical views are very welcome. We appreciate your feedback! If you want to cite this blogpost, use this bibfile.

Translating Middle Egyptian Hieroglyphs

2019-10-29T00:00:00+00:00

I want to welcome you with a warmhearted which means

‘Your heart my be unscathed’

in Middle Egyptian around 5000 years ago and can simply be interpreted as

‘Hello’ 😙

I hope, you’re now also curious to read about the steps that we did to translate a lot of Middle Egyptian resources. So, enjoy and read on!

Once upon a time in Egypt…

For this post, I decided to put the disclaimer early on: We’re actually not translating hieroglyphs into concurrent language by their visual appearance. The signs, that I’ve welcomed you with, are Middle Egyptian graphemes that developed from pictograms. They were used in Ancient Egypt from around 3200 BCE to the 500th century of our common era. So, it’s definitely one of the longest writing traditions out there. But these symbols, which you can observe on the remains of tomb walls and temples, were not used for letters, administrative or legal documents. For these occasions, instead, the forms of cursive hieroglyphs, hieratic and demotic emerged, the latter to be considered as an independent language. Coming back to my disclaimer from above: We have our multiple writing forms of the hieroglyphs (not demotic actually as it is not Middle Egyptian exactly), but we have it in an encoded format: This encoding is based on the monumental research work of Sir Alan Gardiner:

Source	Data
Hieroglyph
Encoding	H6 G43 M17 M17 X1 N5 Z1 G17 D4 G17 H6 Z7 N5
Transcription	šw,yt m jri̯ m šwi
Part-of-speech tags	substantive verb verb preposition substantive
Word-level translation	shadow , not be as sun
Interpreted translation	`Shade, don't be as the blazing sun` (by Mark-Jan Nederhof)

So, we actually have some character/number combinations that map the hieroglyphs to a more manageable code. This originally comes from grouping the signs to semantic categories. Also, the table mirrors a further source: It’s called a transliteration from an Egyptologist’s point of view, but we actually treated it as a source of transcription. This circumscription of the hieroglyphs once was a method to publish Ancient Egyptian resources. But it also represents the translator’s interpretation of word/sentence boundaries and insertion of missing signs. Although, vowels are not reflected within the hieroglyphs, the consonantic transliteration alphabet is expressive enough to consider it as kind of transcription. That’s amazing, as we will then apply machine translation methods from both AST (Automated Speech Translation) and NMT (Neural Machine Translation). For our corpus, that we thankfully received in corporation with the Thesaurus Linguae Aegyptia project, we also have access to the part-of-speech tagging. Sounds good, right? But the whole parallel corpus consisted only of around 30.000 pairs. So we dealt with a pretty tough low resource scenario.

How to exploit all these resources?

We experimented with several techniques, called Backtranslation (NMT) and Pipeline models (AST) as viable opponents to our best player: Multi Task Learning. Here is short introduction of the three guys:

Pipeline Model

This setup is really close to the human approach to translate Middle Egyptian texts: First train an Encoder/Decoder Model that learns to translates hieroglyphs $\to$ transcription. In parallel we also train a model from transcription $\to$ translation. We can then translate any Egyptian Text by first generating the transcription and afterwards using this output to generate translations - just like Egyptologists do it.

Backtranslation

Backtranslation is a famous tool when dealing with low resources: One first trains a backward model on the available data that translates from the target language to the source language. After that, one can take any additional (best: in-domain) target language corpus and backtranslate into the source language - and voilá: There’s our additional data that we can use to train our main system! In our case, we were faced with a very special situation: The database that our corpus was extracted from is filled reversely: The TLA first input the translation and other sources and have not yet finished integrating the Gardiner encoding. This means that on the one hand, we were missing an amount of 60.000 Egyptian encodings, but at least could use the available target sources to actually backtranslate them. We did that and added parts of these “synthetic” sentences bit by bit to the parallel corpus to evaluate if that helped our system to learn.

Multi Task Learning

Last, but not in the very least, we implemented a multi tasking schedule. This technique originates from human learning: When tackling a difficult problem, it might help to

first learn a simpler problem (e.g. creating word boundaries before translating)
deal with a related problem, that helps to generalize the solution of the main problem (e.g. POS tagging before translating) And this is exactly how we could exploit our data sparsity! So, we implemented a schedule within our training that switched from translating hieroglyphs $\to$ transcription to, let’s say, transcription $\to$ hieroglyphs or hieroglyphs $\to$ pos-tags. The key is that the encoder/decoder was shared during that proccess and learned to encode to/decode from multiple resources!

And the winner is….

How did we do? Which system dealt best with the extreme data sparsity? For details, I’ll just point to the paper (Wiesenbach & Riezler, 2019) 😋 But at as a spoiler the one-2-many MTL system cleared the first place by learning from additional 30% transcription and POS tags. The Pipeline and Backtranslaton models both fell short as they just couldn’t leverage the little amount of data. I hope you had a good read and learned some facts about dealing with ancient languages and some techniques for low resource usecases.

Acknowledgment: Thanks to Stefan Riezler for his valuable and much needed feedback for improving this post.

Disclaimer: This blogpost reflects solely the opinion of the author, not any of his affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.

Comments, ideas and critical views are very welcome. We appreciate your feedback! If you want to cite this blogpost, cite the paper instead:

Philipp Wiesenbach and Stefan Riezler

Multi-Task Modeling of Phonographic Languages: Translating Middle Egyptian Hieroglyphs

Proceedings of the International Workshop on Spoken Language Translation (IWSLT), 2019

pdf | bib

@article{wiesenbach19,
  title = {Multi-Task Modeling of Phonographic Languages: Translating Middle Egyptian Hieroglyphs},
  author = {Wiesenbach, Philipp and Riezler, Stefan},
  journal = {Proceedings of the International Workshop on Spoken Language Translation},
  journal-abbrev = {IWSLT},
  year = {2019},
  url = {https://www.cl.uni-heidelberg.de/statnlpgroup/publications/IWSLT2019_v2.pdf}
}

Joey NMT - A Minimalist NMT Toolkit for Novices

2019-10-23T00:00:00+00:00

Proposing Joey NMT.

Another neural machine translation (NMT) toolkit like all the others? No, this one is for you - students, novices, beginners, newbies, and for the lovers of quick prototyping and minimalism. Joey NMT matches the quality of standard toolkits such as Sockeye and OpenNMT with only one fifth of the code!

NMT toolkits have been popping up constantly over the last five years, just as deep learning frameworks keep evolving. As a newcomer it is difficult to find the best path through the NMT toolkit jungle. Guiding features are often 1) popularity, 2) the deep learning framework that the toolkit builds on, 3) the machine translation quality, 4) documentation, 5) speed, and 6) commmunity support. Our goal is to make the start easier with a clean code base, solid documentation and a focus on the important implementation details.

Please find the code on GitHub: joeynmt.

Why Joey NMT?

If you’re working on a thesis on NMT, or an internship project, or you quickly want to implement a research idea, you don’t want to get frustrated by spending days of reading through huge code bases, trying to follow inheritance hierarchies and fill the gaps in the (outdated?) documentation, and updating your fork every day to try to keep up with the most recent changes. So let’s look at what Joey NMT has to offer - I’ll give you five reasons to give Joey NMT a try.

Joey NMT builds on Pytorch, a beginner-friendly Deep Learning library in Python that has lots of open-source tutorials and examples online.
It matches benchmark performance of large-scale industry-led projects like Sockeye for RNN-based and Transformer models. That means you can rely on good baselines and quickly evaluate your new ideas. Find the detailed results here or in the paper.

WMT 17 benchmark results.
Its readability was empirically evaluated in a user study with expert and novice NMT users. Novices were able to quickly understand the code base without teacher, just a little slower than the experts (see “User Study” in our EMNLP paper). The cleanliness of the code base is ensured with the help of Pylint checks. We build a flat hierarchy with maximum one level of inheritance, slightly preferring sequential over hierarchical code solutions.

Example question from the quiz that participants of the study used to explore Joey NMT code.
It has an extensive documentation: docstrings, in-line comments (including tensor shapes!), FAQs and a tutorial, ranging from simple use cases to instructions on how to extend the model, tune and visualize the progress. In fact, the comment-to-code ratio is almost twice as high as in other frameworks. So you’ll actually be able to read natural language, not just code.
Its purpose is to be stable and minimalist rather than implementing the latest hottest feature. No surprises with API changes over night.

And if that’s not enough, here are two more bonus points:

We released pre-trained benchmark models for large-scale tasks (WMT17 en-de/lv) but also on low-resource South-African languages (Autshumato corpus as prepared in the Uxhumana project, en-af/nso/tn/ts/zu). No need for you to re-train these models. You can use them off-the-shelf for translations, distillations, and fine-tuning.
There’s a growing community of people (and accordingly github forks) who use and extend it in different directions, e.g. for learning with various levels of feedback or hieroglyph translation. That means you can take inspiration from other people’s integration solutions. Most prominently, Joey NMT is also used to train NMT models for African languages in the Masakhane project with the goal to put Africa on the NMT map.

What’s in it?

When developing Joey NMT we set the minimalist goal to achieve at least 80% quality compared to SOTA, with 20% of the code. As a result, Joey NMT now provides the following features (aka the bare necessities of NMT):

Recurrent Encoder-Decoder with GRUs or LSTMs
Transformer Encoder-Decoder
Attention Types: MLP, Dot, Multi-Head, Bilinear
Word-, BPE- and character-based input handling
BLEU, ChrF evaluation
Beam search with length penalty and greedy decoding
Customizable initialization
Attention visualization
Learning curve plotting

The EMNLP paper (Kreutzer et al., 2019) describes the details of the RNN and Transformer implementations, and also provides a comparison of features across toolkits (very last page of the Appendix).

What’s next?

How to get started?

Check out the tutorial (YouTube screencast) for a quick walk-through for synthetic data or the Masakhane notebook that describes every step from data preprocessing to model evaluation.
Missing something?

Talk to us on Gitter or raise an issue on GitHub.
How to get involved in development?

If you’d like to contribute, make a pull request for your Joey NMT extensions or look at open issues to see where your help would be welcome.

Acknowledgment: Thanks to all students and colleagues from ICL Heidelberg and the Masakhane project who helped to improve the code quality. And thanks to Stefan Riezler, Mayumi Ohta and Jasmijn Bastings for their feedback on this post.

Disclaimer: This blogpost reflects solely the opinion of the author, not any of her affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.

Comments, ideas and critical views are very welcome. We appreciate your feedback! If you want to cite this blogpost, cite the Joey NMT paper instead:

Julia Kreutzer, Jasmijn Bastings and Stefan Riezler

Joey NMT: A Minimalist NMT Toolkit for Novices

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, Hong Kong, China, 2019

pdf | code | bib

@inproceedings{joey2019,
  author = {Kreutzer, Julia and Bastings, Jasmijn and Riezler, Stefan},
  title = {Joey {NMT}: A Minimalist {NMT} Toolkit for Novices},
  journal = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing {(EMNLP-IJCNLP)}: System Demonstrations},
  year = {2019},
  city = {Hong Kong, China},
  url = {https://www.aclweb.org/anthology/D19-3019}
}

Response-Based and Counterfactual Learning for Sequence-to-Sequence Tasks in NLP: An Overview

2019-08-15T00:00:00+00:00

“We all need people who will give us feedback. That’s how we improve.” - Bill Gates, TED Talks Education, May 2013

Motivation

We all know that supervised data is expensive to obtain. So let’s ask the following question: What if we learn from feedback given to model outputs instead?

Next to reducing the requirement for supervised data, learning from feedback also has several other advantages:

Even if supervised data is given, we want to also discover alternative good outputs.
With feedback given to model outputs, we can improve over time.
It is possible to personalise a system to a specific use case or user.

For these reasons, I explored how to learn from feedback for sequence-to-sequence tasks in NLP in my PhD thesis.

The scenario I assume in my thesis can be summarized with the following picture:

A pre-trained model receives an input for which it produces one or several outputs. An output is grounded in a given external world which assigns some feedback to it. The feedback is then used to update the pre-trained model.

Overview

While exploring how to learn from feedback, there are three different aspects we consider in the thesis. First, we have a final application in mind: we want to build a natural language interface to the geographical database OpenStreetMap (OSM). Second, we consider two different approaches to learn from feedback, response-based on-policy learning and counterfactual off-policy learning. Third, both approaches are applied to two different tasks, semantic parsing for question answering and machine translation.

Overall, the thesis has three parts, which we now look at in turn.

In Part 1 we set up the application of building a natural language interface to OSM. Part 2 and 3 each look at one approach of learning from feedback. In both cases, the approach is applied to both tasks, semantic parsing for question-answering and machine translation. Finally, we conclude by drawing a direct comparison between both approaches.

Read on for the details or jump to the conclusion.

Part 1: A Natural Language Interface to OSM

Question-Answering Task

OpenStreetMap (OSM) is a geographical database populated by volunteers about points of interest (POI) in the world. Currently, it can only be queried with straight-forward string matching methods. But to find POIs with more complex relationships, such as “where is the hotel closest to the main station?”, it is necessary to issue a complicated database query. Because everyday users do not know how to issue such complex queries, we build a natural language interface to OSM. Here, users can ask natural language questions that are then automatically mapped to database queries. The execution of a query against the OSM database yields the corresponding answer. To achieve the automatic mapping, we built a semantic parser that learns to transform a natural language question to a database query, in this context also called a (semantic) parse.

We first collected a manually annotated corpus, NLmaps, of 2,380 question-parse pairs. This corpus was later automatically extended and NLmaps v2 contains 28,609 question-parse pairs.

Semantic Parsers

Using either corpus, allows us to train a semantic parser. For NLmaps v2, we found the best parser to be an encoder-decoder neural network with attention (based on Nematus). Additionally, named entities are handled separately. Prior to the semantic parsing step, another neural network first identifies named entities. Second, these named entities are replaced with placeholder for the semantic parsing step. Finally, the original named entity is added back into the placeholders of the parse. This led to a parser with an answer-level F1 score of about 90%.

With a semantic parser now available, we built a graphical interface for users to access the natural language interface to OSM. After entering a question, it is sent to the semantic parser, which produces a database query. The parse is then executed against the database and both a textual and a graphical answer are displayed for the user. For example, in the picture below a user asked about cuisines in Heidelberg. A list of the various cuisines is displayed and clicking on a cuisine opens pop-up information boxes on relevant markers on the map below.

If you want, try out your own questions!

Part 2: Response-Based On-Policy Learning

We now turn to the first approach to learn from feedback, response-based on-policy learning. The idea of response-based on-policy learning is to ground a model $\pi_w$ in a downstream task for which gold targets are available. A great advantage of this approach is that feedback can be obtained for arbitrarily many outputs.

Concretely, we employ a ramp loss:

$\mathcal{L}_{\mathrm{RAMP}} = - \left( \frac{1}{m} \sum_{t=1}^{m} \pi_w(y_t^+ \vert x_t) - \frac{1}{m} \sum_{t=1}^{m} \pi_w(y_t^- \vert x_t)\right)$ .

In a ramp loss, a hope sequence $y^+$ is encouraged, while a fear sequence $y^-$ is discouraged. The specific instantiations are deferred to concrete tasks. But in general, a hope sequence has a high probability under the current model $\pi_w$ while receiving a high feedback score $\delta$ . In contrast, a fear sequence also obtains a high probability under the current model $\pi_w$ but receives a low feedback score $\delta$ .

Multilingual Semantic Parsing: NLmaps

For this task, we assume a semantic parser can transform English questions into OSM queries, but a user wants to ask questions in German. Thus, we first employ a machine translation system to translate the question from German into English. The goal is to adjust the machine translation system to work well in conjunction with the semantic parser. We use the ramp loss defined above and instantiate $\delta$ to be 1 if a machine-translated question ultimately leads to the correct answer and 0 otherwise. For an overview of the setup, see the picture below.

By using the feedback signal of the downstream semantic parsing task, we can improve a linear-model machine translation system to work better in conjunction with the semantic parser. The adjusted system achieves a higher answer-level F1 score by about 8 percentage points compared to the baseline system. This is the first example that demonstrates the effectiveness of grounding a model in a downstream task.

Question-Answering: NLmaps v2

For many question-answering tasks, it is easier to obtain gold answers rather than gold parses. Thus, it is possible to ground semantic parsers in gold answers and treat the parses as hidden. In this scenario, we can again employ the above defined ramp loss, where a semantic parse receives a feedback of $\delta=1$ if the parse leads to a correct answer and $\delta=0$ otherwise.

On this task, we employ a neural model. Because neural models produce their output token by token, we can assign feedback at the token level. This leads to a new loss function, called Ramp+T, that performs better (for more information, see Chapter 6 of the thesis).

For our experiment, we assume an initial model has been trained on 2k supervised question-parse pairs. For the remainder of the training data, only gold answers, but not gold parses are available. With our new loss function, Ramp+T, grounding the semantic parser in the gold answer, allows us to outperform the baseline model by over 12 percentage points in answer-level F1 score.

We have now successfully applied response-based on-policy learning for two tasks. However, this approach ultimately requires gold targets of a downstream task. This can still be too expensive to obtain. It is for example the case in the OSM domain, e.g. for the question “How many hotels are there in Paris?”, we cannot expect a person to count all 951 hotels in a reasonable amount of time or without error. Consequently, we next look at an approach that requires no gold targets at all.

Part 3: Counterfactual Off-Policy Learning

In the second approach to learn from feedback, counterfactual off-policy learning, we assume that a model is deployed. Users interact with the model and corresponding feedback is logged, hence the deployed model is also called the logging model. Once enough feedback is collected, the collected log can be used to improve either the logging model or any other model. With this setup, we can learn from feedback and do not require any direct or indirect gold targets. For a graphical overview see the picture below.

We update the model offline for several reasons:

Safety: a deployed model that is updated could degenerate without notice, leading to a bad user experience.
Hyperparameters: offline it is possible to do hyperparameter testing.
Validation: the new model can be validated on a test set before it is deployed.

While offline learning provides us with several crucial benefits, it is more challenging, because:

Bandit setup: feedback is only given to one output.
Bias: the logged output is biased towards the choice made by the logging policy.

We refer to the approach as counterfactual because we can ask the following counterfactual question: How would another model have performed if it had been in control during logging?

To employ this approach to learn from feedback, we need to collect a log $D=\{(x_t,y_t,\delta_t)\}_{t=1}^n$ with

$x_t$ : input
$y_t$ : output from logging model $\mu$
$\delta_t$ : feedback received from user

Based on the log, counterfactual estimators can be defined to estimate the performance of another model $\pi_w$ . The model $\pi_w$ can then be updated via stochastic gradient descent (SGD), i.e. $w = w + \eta \nabla_w \mathcal{V}(\pi_w)$ , where $\eta$ is a suitably set learning rate.

In previous literature, it is assumed that outputs are sampled stochastically from the logging model. This leads to the Inverse Propensity Scoring (IPS) estimator, which can correct the bias introduced by the logging model via important sampling:

$\mathcal{V}_{\mathrm{IPS}}(\pi_w) = \frac{1}{n} \sum_{t=1}^n \delta_t \frac{\pi_w(y_t \vert x_t)}{\mu(y_t \vert x_t)}$ .

However, sampling is dangerous because we are at risk of showing inferior outputs to a user, which would lead to a bad user experience. Imagine in the context of machine translation, if one samples from the model output, there is a high risk that the sampled output is not actually a correct translation. For this reason, we want to always select the most likeliest output. This leads to deterministic logging where $\mu(y_t \vert x_t)=1$ for all instances. Consequently, the importance sampling is disabled. We refer to this estimator as Deterministic Propensity Matching (DPM):

$\mathcal{V}_{\mathrm{DPM}}(\pi_w) = \frac{1}{n} \sum_{t=1}^n \delta_t \pi_w(y_t \vert x_t)$ .

We would now like to find out if the deterministic DPM estimator can be used instead of the stochastic IPS estimator for sequence-to-sequence tasks in NLP.

Machine Translation

To investigate whether DPM is feasible in comparison to IPS, we set up a machine translation experiment with simulated feedback. Given an out-of-domain MT system, the system translates in-domain data. To simulate feedback, we use available gold reference. This allows us to create stochastic and deterministic logs where both logs have the same feedback signal.

Both IPS and DPM suffer from high variance and can exhibit degenerative behaviour (see Chapter 7.2 in the thesis). To combat this, we add 2 control variates to each estimator, a multiplicative and an additive control variate (for an overview of control variates see the great slides by Matthew W. Hoffman. This leads to the stochastic ĉDoubly Robust (ĉDR) and the deterministic ĉDoubly Controlled (ĉDC) estimator.

We run experiments on two separate datasets and in both cases the deterministic estimator performs as well as the stochastic one. From this, we conclude that deterministic logging is viable for sequence-to-sequence NLP tasks because there is enough implicit exploration at the word level (see Chapter 7.3.4 in the thesis).

However, we still need to show that counterfactual off-policy learning is possible for sequence-to-sequence NLP tasks when the feedback is obtained from real human users. We tackle this in the next section.

Question-Answering: NLmaps v2

We noted earlier that it is difficult for some question-answering domains to obtain gold answers, e.g. in the case of the OSM domain where we, for example, can’t expect a human to count 951 hotels. As the OSM query language is relatively unknown, it is also difficult to obtain gold parses. Thus, counterfactual off-policy learning, where no gold answers are required, is particularly suitable for the OSM domain.

However, given for example the question “How many hotels are there in Paris?” and a corresponding answer, e.g. “951” or “1,003”, a human still cannot judge whether “951” or “1,003” are correct or not. To solve this issue, we instead propose to make the underlying parse human understandable. We do this by automatically converting the parse into a set of statements that can easily be judged as right or wrong. You can see what this looks like for our example in the following picture:

Once the form is filled out, we can map the individual statements back to the tokens in the parse the produced them. With this approach we collected feedback for 1 $k$ question-parse pairs from 9 humans.

For this task, we again employ a neural model. Because neural models produce their output token by token, we can assign feedback at the token level. That is particularly ideal for our situation because the feedback form already collects feedback at a token level. This leads to the new objective, called DPM+T.

The DPM+T objective does not employ a control variate, but we would like to do so to reduce variance. The multiplicative control variate, reweighting (Swaminathan and Joachims, 2015), we used previously is not applicable to stochastic minibatch learning. To be applicable, we modify this control variate, leading to a new control variate that we refer to as One-Step-Late reweighting (OSL). Together with the previous new objective, this leads to the combined objective, DPM+T+OSL (for more information, see Chapter 8 of the thesis). DPM+T+OSL is the best objective for both learning from the 1 $k$ human feedback instances as well as learning from a larger, but simulated log of 22 $k$ feedback instances.

Comparison of both learning approaches

Because we employ the same NLmaps task and the same neural network architecture for both approaches to learn from feedback, we can directly compare the two approaches.

Unsurprisingly response-based learning outperforms counterfactual learning significantly because it has a better learning signal available. Because response-based learning has a downstream gold target at hand, it can obtain feedback for arbitrarily many model outputs. Counterfactual learning instead only has access to one model output and its feedback. Furthermore, that model output is biased by the logging policy.

Ultimately, the choice between response-based and counterfactual learning reduces to how expensive it is to obtain gold targets. For example, for the OSM domain, it is impractical to obtain gold parse as well as gold answers because the parse can only be written by a handful of people and the answers are too cumbersome to derive for humans. In such a situation, obtaining feedback to model outputs from human users is a viable alternative.

If the base model is good enough, this feedback can directly be collected while real users are interacting with the system. Otherwise, another option would be to recruit human workers to provide the needed feedback.

So in conclusion: counterfactual learning should be chosen if gold targets are impossible, too time consuming or too expensive to obtain, whereas feedback for model outputs can be collected easily. Otherwise, response-based learning is the better approach because the available gold targets offer a stronger learning signal. For an overview of this, also see the following diagram:

Conclusion

It is a good idea to explore how to learn from feedback given to model outputs for several reasons, the primary one being that the collection of direct gold targets might be too expensive.

In my thesis, I explored two separate approaches to learn from feedback, response-based and counterfactual learning. Response-based learning assumes that indirect gold targets are available. Counterfactual learning does not require gold targets and instead saves feedback given by humans interacting with a deployed system in a log.

If (indirect) gold targets can be obtained, response-based learning is the more promising approach because the gold targets offer a stronger learning signal. However, for situation where it is not possible to collect direct or indirect gold targets, counterfactual learning offers a viable alternative.

Next to exploring how to learn from feedback, it was important to me during my PhD project to keep a concrete user application in mind. To this end, I developed a natural language interface to OpenStreetMap (OSM).

My PhD project was a long, but very rewarding journey. I learnt so much and got to join a great NLP community. Special thanks go to my supervisor, Stefan Riezler, who always encouraged my ideas and guided me to the path that led to my thesis. I also want to thank all my colleagues who were always willing to listen and offer suggestions.

If you enjoyed this post and want to discuss anything further, feel free to reach out to me via e-mail or twitter.

More information can be found in the thesis.

Acknowledgment: Thanks to Stefan Riezler and Mayumi Ohta for their valuable feedback to improve this post.

Disclaimer: This blogpost reflects solely the opinion of the author, not any of her affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.

If you want to cite this blogpost, use this bib .

@misc{Lawrence:19,
    author = {Lawrence, Carolin},
    title = {Response-Based and Counterfactual Learning for Sequence-to-Sequence Tasks in NLP: An Overview},
    journal = {StatNLP HD Blog},
    type = {Blog},
    number = {August},
    year = {2019},
    howpublished = {\url{https://www.cl.uni-heidelberg.de/statnlpgroup/blog/lff/}}
}

The Real Challenge of Real-World Reinforcement Learning: The Human Factor

2019-07-26T00:00:00+00:00

The full potential of reinforcement learning requires reinforcement learning agents to be embedded into the flow of real-world experience, where they act, explore, and learn in our world, not just in their worlds. (Sutton & Barto (2018). Reinforcement Learning. An Introduction. 2nd edition)

Recent well-recognized research has shown that artificial intelligence agents can achieve human-like or even superhuman performance in playing Atari games (Mnih et al. 2015), or the game of Go (Silver et al. 2016), without human supervision (Silver et al. 2017), but instead using reinforcement learning techniques for many rounds of self-play. This is a huge achievement in artificial intelligence research, opening the doors for applications where supervised learning is (too) costly, and with ramifications for many other application areas beyond gaming. The question arises how to transfer the superhuman achievements of RL agents under clean room conditions like gaming (where reward signals are well-defined and abundant) to real-world environments with all their shortcomings, first and foremost, the shortcomings of human teachers (who obviously would not pass the Turing test, as indicated in the comic below).

The human factor in real-world RL for natural language processing

Let us have a look at human learning scenarios, for example, natural language translation: A human student of translation and interpretation studies has to learn how to produce correct translations from a mix of feedback types. The human teacher will provide supervision signals in form of gold standard translations in some cases. However, in most cases the student has to learn from weaker teacher feedback that signals how well the student accomplished the task, without knowing what would have happened if the student had produced a different translation, nor what the correct translation should look like. In addition, the best students will become like teachers in that they acquire a repertoire of strategies to self-regulate their learning process (Hattie and Timperley 2007).

Now, if our goal is to build an artificial intelligence agent that learns to translate like a human student, in interaction with a professional human translator acting as the teacher, we see the same pattern of a cost-effectiveness tradeoff: The human translator will not want to provide a supervision signal in form of a correct translation as feedback to every translation produced by the agent, even if this signal is the most informative. Rather, in some cases weaker feedback signals on the quality of the system output, or on parts of it, are a more efficient way of student-teacher interaction. Another scenario is users of online translation systems: They act as consumers - sometimes they might give a feedback signal, but rarely a fully correct translation.

We also see a similar pattern in the quality of the teacher’s feedback signal when training a human and when training an agent: The human teacher of the human translation student and the professional translator acting as a human teacher of the artificial intelligence agent are both human: Their feedback signals can be ambiguous, misdirected, sparse, in short - only human (see the comic above). This is a stark difference to the scenarios in which the success stories of RL have been written - gaming . In these environments reward signals are unambiguous, accurate, and plentiful. One might say that the RL agents playing games against humans received an unfair advantage of an artificial environment that suits their capabilities. However, in order to replicate these success stories for RL in scenarios with learning from human feedback, we should not belittle these successes, but learn from them: The goal should be to give the RL agents that learn from human feedback any possible advantage to succeed in this difficult learning scenario. For this we have to better understand what the real challenges of learning from human feedback consist of.

Disclaimer

In difference to previous work on learning from human reinforcement signals (see, for example, Knox and Stone, Christiano et al. 2017, Leike et al. 2018), our scenario is not one where human knowledge is used to reduce the sample complexity and thus to speed up the learning process of the system, but one where no other reward signals than human feedback are available for interactive learning. This scenario applies to many personalization scenarios where a system that is pre-trained in a supervised fashion is adapted and improved in an interactive learning setup from feedback of the human user. Examples are online advertising, or, machine translation, which we will focus on here.

Recent work (Dulac-Arnold et al. 2019) has recognized that the poorly defined realities of real-world systems are hampering the progress of real-world reinforcement learning. They address, amongst others, issues such as off-line learning, limited exploration, high-dimensional action spaces, or unspecified reward functions. These challenges are important in RL for control systems or robots grounded in the physical world, however, they severly underestimate the human factor in interactive learning. We will use their paper as a foil to address several recognized challenges in real-world RL.

Counterfactual learning under deterministic logging

One of the issues addressed in Dulac-Arnold et al. 2019 is the need for off-line or off-policy RL in applications where systems cannot be updated online. Online learning is unrealistic in commercial settings due to latency requirements and the desire for offline testing of system updates before deployment. A natural solution would be to exploit counterfactual learning that reuses logged interaction data where the predictions have been made by a historic system different from the target system.

However, both online learning and offline learning from logged data are plagued by the problem that exploration is prohibitive in commercial systems since it means to show inferior outputs to users. This effectively results in deterministic logging policies that lack explicit exploration, making an application of standard off-policy methods questionable. For example, techniques such as inverse propensity scoring (Rosenbaum and Rubin 1983), doubly-robust estimation (Dudik et al. 2011), or weighted importance sampling (Precup et al. 2000, Jiang and Li 2016, Thomas and Brunskill 2016) all rely on sufficient exploration of the output space by the logging system as a prerequisite for counterfactual learning. In fact, Langford et al. 2008 and Strehl et al. 2010 even give impossibility results for exploration-free counterfactual learning.

Clearly, standard off-policy learning does not apply when commercial systems interact safely, i.e., deterministically with human users!

So what to do? One solution is to hope for implicit exploration due to input or context variability. This has been observed for the case of online advertising (Chapelle and Li 2012) and investigated theoretically (Bastani et al. 2017). However, natural exploration is something inherent in the data, not something machine learning can optimize for.

Another solution is to consider concrete cases of degenerate behavior in estimation from deterministically logged data, and find solutions that might repeal the impossibility theorems. One such degenerate behavior consists in the fact that the empirical reward over the data log can be maximized by setting probability of all logged data to 1. However, it is clearly undesirable to increase the probability of low reward examples (Swaninathan and Joachims 2015, Lawrence et al. 2017a, Lawrence et al. 2017b). A solution to the problem, called deterministic propensity matching, has been presented by Lawrence and Riezler 2018a, Lawrence and Riezler 2018b and been tested with real human feedback in a semantic parsing scenario. The central idea is as follows: Consider logged data $D = \{(\mathbf{x}^{(h)}, \mathbf{y}^{(h)}, r(\mathbf{y}^{(h)}))\} ^H_{h=1}$ , where $\mathbf{y}^{(h)}$ is sampled from a logging system $\mu(\mathbf{y}^{(h)}|\mathbf{x}^{(h)})$ , and the reward $r(\mathbf{y}^{(h)}) \in [0,1]$ is obtained from a human user. One possible objective for off-line learning under deterministic logging is to maximize the expected reward of the logged data

$L(\theta) = \frac{1}{H}\sum_{h=1}^H r(\mathbf{y}^{(h)}) \, \bar{p}_{\theta,\theta'}(\mathbf{y}^{(h)}|\mathbf{x}^{(h)}),$

where a multiplicative control variate (Kong 1992) is used for reweighting, evaluated one-step-late at $\theta'$ from some previous iteration (for efficient gradient calculation), where

$\bar{p}_{ \theta,\theta'}(\mathbf{y}^{(h)}|\mathbf{x}^{(h)}) = \frac{p_{ \theta}(\mathbf{y}^{(h)}|\mathbf{x}^{(h)})}{\sum_{b=1}^B p_{ \theta'}(\mathbf{y}^{(b)}|\mathbf{x}^{(b)})}.$

The effect of this self-normalization is to prevent that the probability of low reward data can be increased in learning by taking away probability mass from higher reward outputs. This introduces a bias in the estimator (that decreases as $B$ increases), however, it makes learning under deterministic logging feasible, thus giving the RL agent an edge in learning in an environment that has been deemed impossible in the literature. See also Carolin’s blog describing the semantic parsing scenario.

Learning reward estimators from human bandit feedback

Other issues addressed prominently in Dulac-Arnold et al. 2019 are the problems of learning from limited samples, in high dimensional action spaces, with unspecified reward functions. This is a concise description of the learning scenario in interactive machine translation: Firstly, it is unrealistic to expect anything else than bandit feedback from a human user using a commercial machine translation system. That is, a user of an machine translation system will only provide a reward signal to one deterministically produced best system output, and cannot be expected to rate a multitude of translations for the same input. Providers of commercial machine translation systems realize this and provide non-intrusive interfaces for user feedback that allow to post-edit translations (negative signal), or to copy and/or share the translation without changes (positive signal). Furthermore, human judgements on the quality of full translations need to cover an exponential output space, while the notion of translation quality is not a well-defined function to start with: In general every input sentence has a multitude of correct translations, each of which humans may judge differently, depending on many contextual and personal factors.

Surprisingly, the question of how to give the RL agent an advantage in learning from real-world human feedback has been scarcely researched. The suggestions in Dulac-Arnold et al. 2019 may seem straightforward - warm-starting agents to decrease sample complexity or using inverse reinforcement learning to recover reward functions from demonstrations - but they require additional supervision signals that RL was supposed to alleviate. Furthermore, when it comes to the question which type of human feedback is most beneficial for training an RL agent, one finds a lot of blanket statements referring to the advantages of pairwise comparisons to produce a scale (Thurstone 1927), however, without providing any empirical evidence.

An exception is the work of Kreutzer et al. 2018. This work is one of the first to investigate the question which type of human feedback - pairwise judgements or cardinal feedback on a 5-point scale - can be given most reliably by human teachers, and which type of feedback allows to learn reward estimators that best approximate human rewards and can be best integrated into an end-to-end RL task. Let’s look at example interfaces for 5-point feedback and pairwise judgements:

Contrary to common belief, inter-rater reliability was higher for 5-point ratings (Krippendorff’s $\alpha =0.51$ ) than for pairwise judgements ( $\alpha=0.39$ ) in the study of Kreutzer et al. 2018 . They explain this by the possibility to standardize cardinal judgements for each rater to get rid of individual biases, and due to filtering out raters with low intra-rater reliability. The main problem for pairwise judgements were distinctions between similarly good or bad translations, which could be filtered out to improve intra-rater reliability, yielding the final inter-rater reliability given above.

Furthermore, when training reward estimators on judgments collected for 800 translations, they measured learnability by the correlation between estimated rewards and translation edit rate to human reference translations. They found that learnablity was better for a regression model trained on 5-point feedback than for a Bradley-Terry model trained on pairwise rankings (as recently used for RL from human preferences by Christiano et al. 2017).

Finally, and most importantly, when integrating reward estimators into an end-to-end RL task, they found that one can improve a neural machine translation system by more than 1 BLEU point by a reward estimator trained on only 800 cardinal user judgements. This is not only a promising result pointing in the direction in which future research for real-world RL could happen, but it also solves all three of the above mentioned challenges of Dulac-Arnold et al. 2019 (limited samples, high dimensional action spaces, unspecified reward functions) in one approach: Reward estimators can be trained on very small datasets, and then be integrated as reward functions over high dimensional action spaces. The idea is to tackle the arguably simpler problem of learning a reward estimator from human feedback first, then provide unlimited learned feedback to generalize to unseen outputs in off-policy RL.

Further avenues: Self-regulated interactive learning

As mentioned earlier, human students have to be able learn in situations where the most informative learning signals are the sparsest. This is because teacher feedback comes at a cost so that the most precious feedback of gold standard outputs has to be requested economically. Furthermore, students have to learn how to self-regulate their learning process and learn when to seek help and which kind of help to seek. This is different to classic RL games where the cost of feedback is negligible (we can simulate games forever), but this is not realistic in the real world, where especially exploration can get very costly (and dangerous).

Learning to self-regulate is a new research direction that tries to equip an artificial intelligence agent with a decision-making ability that is traditionally hard for humans - balancing cost and effect of learning from different types of feedback, including full supervision by teacher demonstration or correction, weak supervision in the form of positive or negative rewards for student predictions, or a self-supervision signal generated by the student.

Kreutzer and Riezler 2019 have shown how to cast self-regulation as a learning-to-learn problem that solves the above problem by making the agent aware of and manage the cost-reward trade-off. They find in simulation experiments on interactive neural machine translation that the self-regulator is a powerful alternative to uncertainty-based active learning (Settles and Craven 2008), and discovers an $\epsilon$ -greedy strategy for the optimal cost-quality trade-off by mixing different feedback types including corrections, error markups, and self-supervision. Their simulation scenario of course abstracts away from certain confounding variables to be expected in real-life interactive machine learning, however, all of these are interesting directions for new research on real-life RL with human teachers.

The appeal of RL from human feedback

I tried to show that some of the challenges in real-world RL originate from the human teachers who have been considered a help in previous work (Knox and Stone, Christiano et al. 2017, Leike et al. 2018): In situations where only the feedback of a human user is available to personalize and adapt an artificial intelligence agent, the standard tricks of memorizing large amounts of labels in supervised learning, or training in unlimited rounds of self-play with cost-free and accurate rewards in RL, won’t do the job. If we want move RL into the uncharted territories of training artificial intelligence agents from feedback of cost-aware, unfathomable human teachers, we need to make sure the agent does not depend on massive exploration, and we have to learn great models of human feedback. It will be interesting to see how and what artificial intelligence agents learn in the same information-deprived situations that human students have to deal with, and hopefully, it will lead to artificial intelligence agents that can support humans by smoothly adapting to their needs.g

Acknowledgment: Thanks to Julia Kreutzer and Carolin Lawrence for our joint work and their valuable feedback on this post.

Disclaimer: This blogpost reflects solely the opinion of the author, not any of his affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.

Comments, ideas and critical views are very welcome. We appreciate your feedback! If you want to cite this blogpost, use this bib .

@misc{riezler:hrl:19,
    author = {Riezler, Stefan},
    title = {The Real Challenges of Real-World Reinforcement Learning: The Human Factor},
    journal = {StatNLP HD Blog},
    type = {Blog},
    number = {July},
    year = {2019},
    howpublished = {\url{https://www.cl.uni-heidelberg.de/statnlpgroup/blog/hrl/}}
}

Counterfactual Learning of Semantic Parsers When Even Gold Answers Are Unattainable

2019-01-14T00:00:00+00:00

In semantic parsing, natural language questions are mapped to semantic parses. A semantic parse can be executed against a database to obtain an answer. This answer can then be presented to a user.

Semantic parsers for question-answering can be employed in virtual personal assistants which are increasingly on the rise in recent years. Because such assistants are desired to help on an increasing number of tasks, we need to explore the best possible options to efficiently and effectively set up a parser for a new domain, to adapt them for specific user needs and to generally ensure that they improve.

However, obtaining labelled data can be challenging. In this post, we first consider the different possible supervision signals that can be used to train a semantic parser. This influences which objectives can be used for training, which we explore in the second part.

Supervision Signal

Question-Parse Pairs

To train a semantic parser, direct supervision means the collection of question-parse pairs. This can be difficult if the parse language is only understood by expert users. One option is to ensure that the parse language is as broad as possible, e.g. by choosing SQL (Iyer et al., 2017). However, even in the case of SQL, experts are required for the annotation, which can get quickly very expensive.

Question-Answer Pairs

An alternative option is to employ a weaker supervision signal. Collecting question-answer pairs is easier for many domains (Berant et al., 2013; Iyyer et al., 2014; Yang et al., 2015; inter alia) and can typically be done by non-experts.

However, the weaker supervision signal from question-answer presents a harder learning task. While the gold answer is known, it remains unclear which parse will lead to the gold answer. During training, the parser has to explore the output space to find a parse that executes to the correct gold answer. This search can be difficult as the output space is large. Furthermore, instead of finding a parse that represents the correct meaning of the question, one might find a spurious parse instead. Such a parse happens to execute to the gold answer, but conveys the wrong meaning. This hampers generalisation.

For example, assume we have the question “Are there any bars?” and instead of mapping “bar” to the logical form for “", the parser maps it to the logical form of "restaurant" instead. If the answer for both "Are there any bars?" and "Are there any restaurants?" is "Yes", then the wrong logical form "restaurant" for the question "Are there any bars?", will lead to the correct answer. The parser has now wrongly learnt to map "bar" to the logical form "restaurant" and for other questions, such as "Where is the closest bar?" it will now return the closest restaurant instead.

Comparison: Question-Parse vs. Question-Answer Pairs

Yih et al., 2016 investigated the cost and benefit of obtaining question-parse pairs compared to collecting question-answer pairs. For this, they use the WebQuestion corpus Berant et al., 2013 which is based on the Freebase Database. The corpus was originally collected with the help of non-expert crowd-source workers in the form of question-answer pairs. Yih et al., 2016 annotate each question in the corpus with corresponding gold parses. To ease the annotation, they designed a simple user interface and hired experts familiar with Freebase.

Next, they compared a system trained on question-parse pairs to a system trained on question-answer pairs. In their experiments, they were able to show three, in part surprising, results:

The model trained on question-parse pairs outperforms the model on question-answer pairs by over 5 percentage points in answer accuracy.
Answer annotation by crowd-source workers is often incorrect, in their evaluation it was incorrect 34\% of the time.
With an easy to use interface, experts can write the correct semantic parse faster than they can retrieve the correct answer.

Observation 1. does not come as a surprise as question-parse pairs offer a stronger learning signal. But both 2. and 3. are surprising. However, as noted previously, hiring experts to annotate gold parses can be expensive.

A further problem arises for domains where it is not easy to collect gold answers. For example, when answers are open-ended lists, fuzzily defined or very large.

This is for example the case on the domain of geographical question-answering using the OpenStreetMap database. Here, the underlying parse language is only known to a few expert users, which makes the collection of gold parses particularly difficult. Furthermore, it is often impossible to collect gold answers because in many cases the gold answer set is too large or fuzzily defined (e.g. when searching for objects “near” another one) to be obtained in a reasonable amount of time or without error.

Question-Feedback Pairs

In cases were both the collection of gold parses and gold answers is infeasible, we need to obtain a learning signal from other sources. One option is to obtain feedback from users while they are interacting with the system (Lawrence&Riezler 2018).

For this, a baseline semantic parser is trained on a small amount of question-parse pairs. This parser can be used to parse further questions for which neither gold parses nor gold answers exist. The parse suggested by the baseline, can then be automatically transformed into a set of human understandable statements. Given to human users, they can easily judge each statement as correct or incorrect. This feedback can be used to further improve the parser.

For example, below is a question and the statements automatically generated from the corresponding parse.

With the filled in form, we know which parts of the parse are wrong.

This allows us to go further than just promoting correct parses. For each statement, we are able to map it back to the tokens in the parse that produced it. This allows us to learn from partially correct parses, where we only promote the tokens associated with correct statements.

Objectives

The collected data decides which objectives can be applied during training. Below we give an overview of various objectives, which data they require and what their advantages and disadvantages are.

First off, here is some general notation:

$\pi_w$ : neural network with parameters $w$
$x = x_1, x_2, \dots x_{\mid x\mid }$ : input question
$y = y_1, y_2, \dots y_{\mid y\mid }$ : output parse
$\bar{y} = \bar{y}_1, \bar{y}_2, \dots \bar{y}_{\mid \bar{y}\mid }$ : gold parse
$\bar{a}$ : gold answer

We define an objective in terms of a loss function $\mathcal{L}$ . For training, we derive it with regards to the model’s parameters $w$ to make (stochastic) gradient descent updates, $w = w - \eta \nabla_w \mathcal{L}$ , where $\eta$ is a suitable learning rate.

Question-Parse Pairs: Maximum Likelihood Estimation (MLE)

Neural networks are typically trained using MLE, where the probability of a gold parse $\bar{y}$ is raised for given a question $x$ (e.g. Dong & Lapata, 2016 or Jia & Liang, 2016). The objective is defined as follows:

$\mathcal{L}_{MLE} = - \sum_{j=1}^{\mid \bar{y}\mid } \log \pi_w(\bar{y}_{j} \mid \bar{y}_{<j}, x),$

where $\bar{y}_{<j} = y_{1}, y_{2}, \dots y_{j-1}.$

However, this approach is only possible if gold targets $\bar{y}$ are available. As mentioned in the first section, obtaining these might be too expensive in praxis and weaker supervision signals are the practical alternative.

There is a further reason for a different objective, even when question-parse pairs are available:

There might be other parses, not just the annotated gold parse, that lead to the correct answer. But these can never be discovered if the MLE objective is used. Discovering further valid parses could stabilise learning and help generalisation. Further, this allows the parser to find suitable parses in its own output space.

Next, we turn to objectives which assume the existence of gold answers. Either from executing gold parses to obtain gold answers or because gold answers where annotated. For these objectives, a parser produces model outputs which are executed to obtain a corresponding answers. The answers can be compared to the available gold answer and a reward can be assigned to the various model outputs.

Question-Answer Pairs: REINFORCE and Minimum Risk Training (MRT)

Recently, there has been a popular surge of applying reinforcement learning approaches, in particular the REINFORCE algorithm (Williams 1992), to (weakly) supervised NLP tasks. The inherent issues that arise in this context, are also explored with regards to neural machine translation in another blog post.

We will first introduce the REINFORCE algorithm, then discuss potential issues.

In REINFORCE, given an input question $x$ , one output $y$ is sampled from the current model distribution (see Section 13.3 of Sutton & Barto, 2018). Executing this sampled parse to obtain an answer $a$ , the comparison with the gold answer $\bar{a}$ provides us with a reward $\delta$ . On the basis of this single reward, the model’s parameters are updated, i.e. we can define the following objective:

$\mathcal{L}_{REINFORCE} = - \delta \pi_w(y\mid x).$

However, with just one sample, this objective can suffer from high variance. This can be combated by introducing control variates, which lower variance. The most popular choice is using a baseline, where we keep track of the average reward, which is subtracted from $\delta$ .

But why only sample one output?

We have the luxury of having gold answers available.

This allows us to sample several outputs and obtain rewards for all of them. With this, an average can be computed and on the basis of this average updates to $w$ are performed. First, this lowers the variance. Second, it allows us to try out several model outputs, which helps us to explore the output space and in turn increases our chance of finding a parse that leads to the correct answer.

Building an average based on several outputs obtained for one input, is exactly the characteristic idea of Minimum Risk Training (MRT).

MRT was introduced in the context of log-linear models for dependency parsing and machine translation (Smith & Eisner, 2006). It has also been tested for neural models in the context of machine translation (Shen et al., 2016).

Sampling $S$ outputs per input, we can define the following MRT objective:

$\mathcal{L}_{MRT} = - \frac{1}{S} \sum_{s=1}^{S} \delta \pi_w(y_s\mid x).$

This objective is for example employed in Liang et al., 2017. Although they use the term REINFORCE (“We apply REINFORCE”), their later objective is based upon $S$ outputs (“Thus, in contrast with common practice of approximating the gradient by sampling from the model, we use the top- $k$ action sequences”), which is reminiscent of MRT. Similarly, Guu et al., 2017 also calculate an average over several output samples for one input (see their Equation 9). Mou et al., 2017 also take advantage of the gold answers to sample and evaluate several parses for one input (“We adjust the reward by subtracting the mean reward, averaged over sampled actions for a certain data point.”).

MRT is superior because by sampling several outputs per input, it exhibits lower variance than REINFORCE. But it can only be applied if question-answer pairs are available. Additionally, it is more expensive to compute.

For our final scenario from the previous section, where neither gold answers nor gold parses are available and we only have feedback collected for one model output, we are limited to only one sample and MRT cannot be applied.

Let’s see which objectives we can apply in such scenarios.

Question-Feedback Pairs: REINFORCE and Counterfactual Learning

A setup, where only one outputs and its corresponding feedback is available, is also called a bandit learning scenario. The name is inspired from choosing one among several slot machines (colloquially referred to as “one-armed bandit”), where we only observe the reward for the chosen machine (i.e. output) and it remains unknown what reward the other machines (or outputs) would have obtained.

This is a crucial contrast to learning from question-answer pairs. We illustrate this graphically in the figure below. The left side shows the scenario where question-answer pairs are available, whereas the right assumes question-feedback pairs where no gold answers are available.

REINFORCE is still applicable in bandit learning scenarios. But if we collect feedback as users are using the system, it can be dangerous to update the parser’s parameters online.

The parser’s performance could deteriorate without notice which can lead to user dissatisfaction and monetary loss. It also makes it impossible to explore different hyperparameter setting.

Instead, it is safer to first collect the feedback in a log of triples $\mathcal{D}_{log}=\{(x_m,y_m,\delta_m)\}_{m=1}^M$ . Once enough feedback has been collected, the log can be used to further improve the parser offline. The resulting model can then be validated against additional test sets before it is deployed.

However, once we start learning, the outputs produced in log might no longer be the outputs the updated parser would choose; i.e. the log we collected is biased towards the parser that was deployed at the time. Learning from such a log leads to a counterfactual, or off-policy, learning setup.

The bias in the log can be corrected using importance sampling, where we divide the probability that the new model $\pi_w$ prescribes to the logged output, by the probability that the deployed parser $\mu$ assigned to that output. This leads to the following Inverse Propensity Score (IPS) objective:

$\mathcal{L}_{IPS} = - \frac{1}{M} \sum_{m=1}^{M} \delta \frac{\pi_w(y_m\mid x_m)}{\mu(y_m\mid x_m)}.$

However, because we ideally want to present only correct parses and answers to our users, we want to always show the most likely output under the currently deployed model. This results in a deterministic log because the probability of choosing the most likely output is always one. Consequently, we can no longer correct the data bias.

This leads to the Deterministic Propensity Matching (DPM) objective:

$\mathcal{L}_{DPM} = - \frac{1}{M} \sum_{m=1}^{M} \delta \pi_w(y_m\mid x_m).$

Just like REINFORCE, both IPS and DPM suffer from high variance because only one output received a reward for each input. It is thus advisable to employ control variates, e.g. one-step-late reweighting (Lawrence & Riezler, 2018).

With these objectives it is possible to learn from question-feedback pairs where feedback was collected from users for one system output. This approach is useful for scenarios where neither the collection of gold parses nor the collection of gold answers is feasible.

Furthermore, this approach can also be applied to other tasks, such as machine translation (Lawrence et al., 2017 ; Kreutzer et al., 2018a ; Kreutzer et al., 2018b).

Summary

Semantic parsers are important modules in virtual personal assistants and with an increasing number of domains in which these assistants are used, we need to find efficient and effective methods to train parsers on new domains.
In general, the stronger the learning signal, the better the result. For each new domain, we should estimate the time and cost of the different approaches, while keeping in mind that: question-parse pairs > question-answer pairs > question-feedback pairs.
If it is too expensive to obtain gold parses or gold answers, then using counterfactual learning from question-feedback pairs is a viable alternative.

Acknowledgment: Thanks to Julia Kreutzer and Stefan Riezler for their valuable and much needed feedback for improving this post.

Disclaimer: This blogpost reflects solely the opinion of the author, not any of her affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.

Comments, ideas and critical views are very welcome. In particular, feel free to let us know if you think there is an important paper that we should add to this overview! We appreciate your feedback! If you want to cite this blogpost, use this bibfile.

Taming Wild Reward Functions: The Score Function Gradient Estimator Trick

2018-11-12T00:00:00+00:00

MLE is often not enough to train sequence-to-sequence neural networks in NLP. Instead we employ an external metric, which is a reward function that can help us judge model outputs. The parameters of the network are then updated on the basis of the model outputs and corresponding rewards.

For this update, it is necessary to obtain a derivative.

But how can we do this, if the external function is unknown or cannot be derived?

Enter: The score function gradient estimator trick.

Why MLE is not Enough

Traditionally, neural networks are trained using Maximum Likelihood Estimation (MLE): given an input sequence $x = x_1, x_2, \dots x_{ \mid x \mid }$ and a corresponding gold target sequence $\bar{y} = \bar{y}_1, \bar{y}_2, \dots \bar{y}_{ \mid \bar{y} \mid }$ , we want to increase the probability that the current model $\pi$ with parameters $w$ assigns for the pair $(x,\bar{y})$ . This gives the following loss function:

$\mathcal{L}_{MLE} = - \sum_{j=1}^{ \mid \bar{y} \mid } \log \pi_w(\bar{y}_{j} \mid \bar{y}_{<j}, x),$

where $\bar{y}_{<j} = \bar{y}_1, \bar{y}_2, \dots \bar{y}_{j-1}.$

The parameters $w$ of $\pi$ are then updated using stochastic gradient descent,

$w = w - \eta \nabla_w \mathcal{L}_{MLE}.$

But there are various issues with using MLE that has led researchers to explore alternative objectives. Let’s looks at them next.

1. Gold targets $\bar{y}$ are not Available

This is most prominently the case in many domains of semantic parsing for question-answering, where questions $x$ are mapped to a semantic parse $y$ , which can be executed to obtain an answer $a$ . For many domains, it is easier to collect question-answer pairs, rather than question-parse pairs (e.g. see Berant et al. 2013). But with no gold parses available, MLE cannot be applied.

What can we do instead?

The current model produces a set of likely parses (e.g. by sampling from the model distribution or by employing beam search). Each parse is then executed to obtain an answer. Next, we compare the answer to the gold answer to get a reward $\delta$ . Generally, we have $\delta=0$ if there is no overlap between answer and gold answer and $\delta=1$ if they match exactly. With this, we can update the model’s parameters.

2. Exposure Bias: Ranzato et al. 2016

During traditional MLE training the model is fed the perfect tokens from the available gold target $\bar{y}$ , but at test time the output sequence is produced on the basis of the model distribution. This causes a distribution mismatch and inferior performance.

How can we reduce this mismatch?

Instead, we can feed model output sequences already at training time. Typically, once an entire output sequence has been produced, this sequence is judged by an external metric and the resulting reward $\delta$ can be used as feedback to update the model’s parameters.

3. Loss-Evaluation Mismatch: Wiseman & Rush 2016

MLE is agnostic to the final evaluation metric. Ideally we would like to have the final evaluation metric in the objective used at training time, so that the parameters of the model are specifically tuned to perform well on the intended task.

How can we do that?

Similar to problem (2.), we can feed model output sequences at training time. In this case the external metric is the final evaluation metric. For example, in the case of machine translation, typically a per-sentence approximation of the BLEU score is used.

Maximise the Expected Reward Obtained for Model Outputs

To solve all three problems, we can instead maximise the expected reward $\delta$ or, equivalently, minimise the expected risk $-\delta$ . This can be formulated as the following expectation:

$\mathcal{L}_\delta = \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [-\delta],$

where $p(x)$ is the probability distribution over inputs $x$ and $\pi_w(y \mid x)$ is the probability distribution over outputs $y$ given $x$ .

In praxis, this expectation has to be approximated. For example, using Monte-Carlo sampling leads to the REINFORCE algorithm (Williams 1992): we sample one output $y$ from the model distribution $\pi_w(y \mid x)$ (see also Chapter 13 of Sutton & Barto 2018). Approximating the expectation over $y$ , the actual training objective becomes:

$\mathcal{L}_{REINFORCE} = - \delta \pi_w(y \mid x) \approx \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [-\delta].$

The goal of this objective is to increase the probability of an output proportionally to its reward. The gradient of the REINFORCE objective is an unbiased estimate of the gradient of the $\mathcal{L}_\delta$ objective.

Alternatively, we can use Minimum Risk Training (MRT) (Smith & Eisner ‘06, Shen et al. 2016). Here, several outputs are sampled from the model distribution. This stabilises learning, but requires that more outputs are evaluated to get corresponding rewards. Assuming $S$ sampled outputs, the objective then takes the following form:

$\mathcal{L}_{MRT} = - \frac{1}{S} \sum_{s=1}^{S} \delta_s \pi_w(y_s \mid x) \approx \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [-\delta].$

Due to sampling, both approaches can suffer from high variance, which can be combatted using control variates (see for example Chapter 9 of Ross 2013).

The Problem: The Reward Function cannot be Derived

To minimize $\mathcal{L}_{\delta}$ with stochastic gradient descent, it is necessary to calculate $\nabla_w \mathcal{L}_{\delta}$ , also called the policy gradient in Reinforcement Learning (RL) terms.

But in praxis, the rewards $\delta$ are typically either from an unknown function (e.g. if rewards are collected from human users) or the underlying function cannot be derived (e.g. in the case of BLEU).

As such, it is not immediately clear how to derive $\mathcal{L}_{\delta}$ , i.e. how to calculate $\nabla_w\mathcal{L}_{\delta}.$

The Solution: Score Function Gradient Estimator

To be able to calculate $\nabla_w\mathcal{L}_{\delta}$ , we use two tricks:

1. The $\log$ Derivative Trick

The derivative of the logarithm is:

$\nabla_w \log f = \frac{\nabla_w f}{f}.$

2. The Identity Trick

$f = \frac{g}{g} f$

Now we can formulate what is known as the score function gradient estimator (Fu ‘06):

$\begin{align} \nabla_w \mathcal{L}_\delta &= \nabla_w \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [- \delta] & (1) \\ &= \nabla_w \int_{x} \int_{y} -\delta \, \cdot \, p(x)\textrm{d}x \, \cdot \, \pi_w(y \mid x)\textrm{d}y & (2) \\ &= \int_{x} \int_{y} -\delta \, \cdot \, p(x)\textrm{d}x \, \cdot \, \nabla_w\pi_w(y \mid x)\textrm{d}y & (3) \\ &= \int_{x} \int_{y} -\delta \, \cdot \, p(x)\textrm{d}x \, \cdot \, \nabla_w \frac{\pi_w(y \mid x)}{\pi_w(y \mid x)} \, \cdot \, \pi_w(y \mid x)\textrm{d}y & (4) \\ &= \int_{x} \int_{y} -\delta \, \cdot \, p(x)\textrm{d}x \, \cdot \, \nabla_w \log \pi_w(y \mid x) \, \cdot \, \pi_w(y \mid x)\textrm{d}y & (5) \\ &= \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [-\delta \nabla_w \log \pi_w(y \mid x)]. & (6) \end{align}$

Let’s investigate for each line what happened:

(2): The expectation is expanded into two integrals. $\mathbb{E}_{p(x)}$ becomes $\int_{x} \dots p(x)\textrm{d}x$ and $\mathbb{E}_{\pi_w(y \mid x)}$ turns into $\int_{y} \dots \pi_w(y \mid x)\textrm{d}y$ .
(3): Integral and differentiation can be switched, so we move $\nabla_w$ in front of $\pi_w(y \mid x)$ because $\pi_w(y \mid x)$ is the only term dependent on $w$ .
(4): We use the identity trick with $g = \pi_w(y \mid x)$ .
(5): We use the $\log$ derivative trick.
(6): We still have $\pi_w(y \mid x)\textrm{d}y$ available. With this, we can transform the expression back into an expectation. But in contrast to before, we now have $\nabla_w \log \pi_w(y \mid x)$ and this derivative is simply scaled by $\delta$ .

$\rightarrow$ We no longer need to know what the function that produces $\delta$ looks like or derive it.

For an alternative view on the subject, also see this great blog post.

When can it be applied?

The score function gradient estimator can be applied independent of the underlying model, as long as it has a derivative.

E.g. if $\pi_w(y \mid x)$ is a log-linear model with feature vectors $\phi(x,y)$ ,

$\pi_w(y \mid x) = \frac{e^{ w \phi(x,y)}}{\sum_{y\in \mathbf{Y}(x)} e^{ w \phi(x, y)}},$

then the derivative would be

$\nabla \log \pi_w(y \mid x) = \phi(x,y) - \sum_{y\in \mathbf{Y}(x)} \phi(x, y)\pi_w(y \mid x).$

In the case of neural networks, backpropogation is applied to derive $\nabla_w \pi_w(y \mid x)$ (see for example Chapter 3 of Cho 2015).

Lessons Learnt

MLE can sometimes not be applied or cause inferior performance.
Instead, we leverage rewards from an external metric that evaluates the quality of our model ouputs.
The metric might be unknown or cannot be derived: (stochastic) gradient descent cannot be applied directly.
The score function gradient estimator helps us side-step this problem.

Acknowledgment: Thanks to Julia Kreutzer for her valuable and much needed feedback for improving this post.

Disclaimer: This blogpost reflects solely the opinion of the author, not any of her affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.

Comments, ideas and critical views are very welcome. We appreciate your feedback! If you want to cite this blogpost, use this bibfile.

StatNLP Heidelberg

RL in NMT: The Good, the Bad and the Ugly

The Basics

Maximum Likelihood Estimation

Expected Reward

Policy Gradient

Training

Discussing Recent Trends of RL in NMT

RL Tricks

Variance Reduction

Reward Shaping

Using Monolingual Data

Target-side

Source-side

NMT as an RL problem

RL to the rescue?

Beyond supervised learning

Translating Middle Egyptian Hieroglyphs

Once upon a time in Egypt…

How to exploit all these resources?

Pipeline Model

Backtranslation

Multi Task Learning

And the winner is….

Joey NMT - A Minimalist NMT Toolkit for Novices

Why Joey NMT?

What’s in it?

What’s next?

Response-Based and Counterfactual Learning for Sequence-to-Sequence Tasks in NLP: An Overview

Motivation

Overview

Part 1: A Natural Language Interface to OSM

Question-Answering Task

Semantic Parsers

Part 2: Response-Based On-Policy Learning

Multilingual Semantic Parsing: NLmaps

Question-Answering: NLmaps v2

Part 3: Counterfactual Off-Policy Learning

Machine Translation

Question-Answering: NLmaps v2

Comparison of both learning approaches

Conclusion

The Real Challenge of Real-World Reinforcement Learning: The Human Factor

The human factor in real-world RL for natural language processing

Disclaimer

Counterfactual learning under deterministic logging

Learning reward estimators from human bandit feedback

Further avenues: Self-regulated interactive learning

The appeal of RL from human feedback

Counterfactual Learning of Semantic Parsers When Even Gold Answers Are Unattainable

Supervision Signal

Question-Parse Pairs

Question-Answer Pairs

Comparison: Question-Parse vs. Question-Answer Pairs

Question-Feedback Pairs

Objectives

Question-Parse Pairs: Maximum Likelihood Estimation (MLE)

Question-Answer Pairs: REINFORCE and Minimum Risk Training (MRT)

Question-Feedback Pairs: REINFORCE and Counterfactual Learning

Summary

Taming Wild Reward Functions: The Score Function Gradient Estimator Trick

Why MLE is not Enough

1. Gold targets \bar{y} are not Available

2. Exposure Bias: Ranzato et al. 2016

3. Loss-Evaluation Mismatch: Wiseman & Rush 2016

Maximise the Expected Reward Obtained for Model Outputs

The Problem: The Reward Function cannot be Derived

The Solution: Score Function Gradient Estimator

1. The \log Derivative Trick

2. The Identity Trick

When can it be applied?

Lessons Learnt

1. Gold targets $\bar{y}$ are not Available

1. The $\log$ Derivative Trick