Jekyll2019-08-15T08:37:23+00:00http://www.cl.uni-heidelberg.de/statnlpgroup/feed.xmlStatNLP HeidelbergStatistical NLP Group at Heidelberg University, GermanyStatNLP Groupstatnlpgroup@cl.uni-heidelberg.deResponse-Based and Counterfactual Learning for Sequence-to-Sequence Tasks in NLP: An Overview2019-08-15T00:00:00+00:002019-08-15T00:00:00+00:00http://www.cl.uni-heidelberg.de/statnlpgroup/blog/learn_from_feedback<!--# Response-Based and Counterfactual Learning for Sequence-to-Sequence Tasks in NLP-->
<p>“We all need people who will give us feedback. That’s how we improve.” - Bill Gates, TED Talks Education, May 2013</p>
<h2 id="motivation">Motivation</h2>
<p>We all know that <strong>supervised data is expensive</strong> to obtain. So let’s ask the following question: What if we learn from feedback given to model outputs instead?</p>
<p>Next to reducing the requirement for supervised data, learning from feedback also has several other advantages:</p>
<ul>
<li>Even if supervised data is given, we want to also discover <strong>alternative good outputs</strong>.</li>
<li>With feedback given to model outputs, we can <strong>improve over time</strong>.</li>
<li>It is possible to <strong>personalise a system</strong> to a specific use case or user.</li>
</ul>
<p>For these reasons, I explored how to learn from feedback for sequence-to-sequence tasks in NLP in my PhD thesis.</p>
<p>The scenario I assume in my thesis can be summarized with the following picture:</p>
<p><img src="/statnlpgroup/images/blog/2019-08_problem_overview.png" alt="problem_overview" /></p>
<p>A pre-trained model receives an input for which it produces one or several outputs. An output is grounded in a given external world which assigns some feedback to it. The feedback is then used to update the pre-trained model.</p>
<h2 id="overview">Overview</h2>
<p>While exploring how to learn from feedback, there are three different aspects we consider in the thesis.
First, we have a final application in mind: we want to build a natural language interface to the geographical database <a href="http://www.openstreetmap.org">OpenStreetMap (OSM)</a>. Second, we consider two different approaches to learn from feedback, response-based on-policy learning and counterfactual off-policy learning. Third, both approaches are applied to two different tasks, semantic parsing for question answering and machine translation.</p>
<p><img src="/statnlpgroup/images/blog/2019-08_roadmap.jpg" alt="roadmap" /></p>
<p>Overall, the thesis has three parts, which we now look at in turn.</p>
<p>In Part 1 we set up the application of building a natural language interface to OSM. Part 2 and 3 each look at one approach of learning from feedback. In both cases, the approach is applied to both tasks, semantic parsing for question-answering and machine translation. Finally, we conclude by drawing a direct comparison between both approaches.</p>
<p>Read on for the details or <a href="#conclusion">jump to the conclusion</a>.</p>
<h2 id="part-1-a-natural-language-interface-to-osm">Part 1: A Natural Language Interface to <a href="http://www.openstreetmap.org">OSM</a></h2>
<h3 id="question-answering-task">Question-Answering Task</h3>
<p><a href="http://www.openstreetmap.org">OpenStreetMap (OSM)</a> is a geographical database populated by volunteers about points of interest (POI) in the world. Currently, it can only be queried with straight-forward string matching methods. But to find POIs with more complex relationships, such as “where is the hotel closest to the main station?”, it is necessary to issue a complicated database query. Because everyday users do not know how to issue such complex queries, we build a natural language interface to OSM. Here, users can ask natural language questions that are then automatically mapped to database queries. The execution of a query against the OSM database yields the corresponding answer. To achieve the automatic mapping, we built a semantic parser that learns to transform a natural language question to a database query, in this context also called a (semantic) parse.</p>
<p>We first collected a manually annotated corpus, <a href="https://www.cl.uni-heidelberg.de/statnlpgroup/nlmaps/">NLmaps</a>, of 2,380 question-parse pairs. This corpus was later automatically extended and <strong><a href="https://www.cl.uni-heidelberg.de/statnlpgroup/nlmaps/">NLmaps v2</a> contains 28,609 question-parse pairs</strong>.</p>
<h3 id="semantic-parsers">Semantic Parsers</h3>
<p>Using either corpus, allows us to train a semantic parser. For NLmaps v2, we found the best parser to be an encoder-decoder neural network with attention (based on <a href="https://github.com/EdinburghNLP/nematus">Nematus</a>). Additionally, named entities are handled separately. Prior to the semantic parsing step, another neural network first identifies named entities. Second, these named entities are replaced with placeholder for the semantic parsing step. Finally, the original named entity is added back into the placeholders of the parse. This led to a parser with an answer-level F1 score of about 90%.</p>
<p>With a semantic parser now available, we built <strong>a <a href="https://nlmaps.cl.uni-heidelberg.de/">graphical interface</a> for users to access the natural language interface to OSM</strong>. After entering a question, it is sent to the semantic parser, which produces a database query. The parse is then executed against the database and both a textual and a graphical answer are displayed for the user. For example, in the picture below a user asked about cuisines in Heidelberg. A list of the various cuisines is displayed and clicking on a cuisine opens pop-up information boxes on relevant markers on the map below.</p>
<p><img src="/statnlpgroup/images/blog/2019-08_interface.jpg" alt="interface" /></p>
<p>If you want, <a href="https://nlmaps.cl.uni-heidelberg.de/">try out your own questions</a>!</p>
<h2 id="part-2-response-based-on-policy-learning">Part 2: Response-Based On-Policy Learning</h2>
<p>We now turn to the first approach to learn from feedback, response-based on-policy learning. The idea of response-based on-policy learning is to <strong>ground a model <script type="math/tex">\pi_w</script> in a downstream task for which gold targets are available</strong>. A great advantage of this approach is that feedback can be obtained for arbitrarily many outputs.</p>
<p>Concretely, we employ a ramp loss:</p>
<p><script type="math/tex">\mathcal{L}_{\mathrm{RAMP}} = - \left( \frac{1}{m} \sum_{t=1}^{m} \pi_w(y_t^+ \vert x_t) - \frac{1}{m} \sum_{t=1}^{m} \pi_w(y_t^- \vert x_t)\right)</script>.</p>
<p>In a ramp loss, a hope sequence <script type="math/tex">y^+</script> is encouraged, while a fear sequence <script type="math/tex">y^-</script> is discouraged. The specific instantiations are deferred to concrete tasks. But in general, a hope sequence has a high probability under the current model <script type="math/tex">\pi_w</script> while receiving a high feedback score <script type="math/tex">\delta</script>. In contrast, a fear sequence also obtains a high probability under the current model <script type="math/tex">\pi_w</script> but receives a low feedback score <script type="math/tex">\delta</script>.</p>
<h3 id="multilingual-semantic-parsing-nlmaps">Multilingual Semantic Parsing: NLmaps</h3>
<p>For this task, we assume a semantic parser can transform English questions into OSM queries, but a user wants to ask questions in German. Thus, we first employ a machine translation system to translate the question from German into English. The goal is to adjust the machine translation system to work well in conjunction with the semantic parser. We use the ramp loss defined above and instantiate <script type="math/tex">\delta</script> to be 1 if a machine-translated question ultimately leads to the correct answer and 0 otherwise. For an overview of the setup, see the picture below.</p>
<p><img src="/statnlpgroup/images/blog/2019-08_multilingual_ramp.png" alt="multilingual_ramp" /></p>
<p>By using the feedback signal of the downstream semantic parsing task, we can improve a linear-model machine translation system to work better in conjunction with the semantic parser. The adjusted system achieves a higher answer-level F1 score by about 8 percentage points compared to the baseline system. This is the first example that demonstrates the effectiveness of grounding a model in a downstream task.</p>
<h3 id="question-answering-nlmaps-v2">Question-Answering: NLmaps v2</h3>
<p>For many question-answering tasks, it is easier to obtain gold answers rather than gold parses. Thus, it is possible to ground semantic parsers in gold answers and treat the parses as hidden. In this scenario, we can again employ the above defined ramp loss, where a semantic parse receives a feedback of <script type="math/tex">\delta=1</script> if the parse leads to a correct answer and <script type="math/tex">\delta=0</script> otherwise.</p>
<p>On this task, we employ a neural model. Because neural models produce their output token by token, we can assign feedback at the token level. This leads to a new loss function, called Ramp+T, that performs better (for more information, see Chapter 6 of the <a href="https://www.cl.uni-heidelberg.de/~lawrence/Lawrence18.pdf">thesis</a>).</p>
<p>For our experiment, we assume an initial model has been trained on 2k supervised question-parse pairs. For the remainder of the training data, only gold answers, but not gold parses are available. With our new loss function, Ramp+T, grounding the semantic parser in the gold answer, allows us to outperform the baseline model by over 12 percentage points in answer-level F1 score.</p>
<p>We have now <strong>successfully applied response-based on-policy learning for two tasks</strong>. However, this approach ultimately <strong>requires gold targets of a downstream task</strong>. This can still be too expensive to obtain. It is for example the case in the OSM domain, e.g. for the question “How many hotels are there in Paris?”, we cannot expect a person to count all 951 hotels in a reasonable amount of time or without error. Consequently, we next look at an approach that requires no gold targets at all.</p>
<h2 id="part-3-counterfactual-off-policy-learning">Part 3: Counterfactual Off-Policy Learning</h2>
<p>In the second approach to learn from feedback, counterfactual off-policy learning, we assume that a model is deployed. Users interact with the model and corresponding feedback is logged, hence the deployed model is also called the logging model. Once enough feedback is collected, the collected log can be used to improve either the logging model or any other model. With this setup, we can learn from feedback and <strong>do not require any direct or indirect gold targets</strong>. For a graphical overview see the picture below.</p>
<p><img src="/statnlpgroup/images/blog/2019-08_loglearn_schema.png" alt="loglearn_schema" /></p>
<p>We update the model offline for several reasons:</p>
<ul>
<li>Safety: a deployed model that is updated could degenerate without notice, leading to a bad user experience.</li>
<li>Hyperparameters: offline it is possible to do hyperparameter testing.</li>
<li>Validation: the new model can be validated on a test set before it is deployed.</li>
</ul>
<p>While offline learning provides us with several crucial benefits, it is more challenging, because:</p>
<ul>
<li>Bandit setup: feedback is only given to one output.</li>
<li>Bias: the logged output is biased towards the choice made by the logging policy.</li>
</ul>
<p>We refer to the approach as <em>counterfactual</em> because we can ask the following counterfactual question: <em>How would another model have performed if it had been in control during logging?</em></p>
<p>To employ this approach to learn from feedback, we need to collect a log <script type="math/tex">D=\{(x_t,y_t,\delta_t)\}_{t=1}^n</script> with</p>
<ul>
<li><script type="math/tex">x_t</script>: input</li>
<li><script type="math/tex">y_t</script>: output from logging model <script type="math/tex">\mu</script></li>
<li><script type="math/tex">\delta_t</script>: feedback received from user</li>
</ul>
<p>Based on the log, counterfactual estimators can be defined to estimate the performance of another model <script type="math/tex">\pi_w</script>. The model <script type="math/tex">\pi_w</script> can then be updated via stochastic gradient descent (SGD), i.e. <script type="math/tex">w = w + \eta \nabla_w \mathcal{V}(\pi_w)</script>, where <script type="math/tex">\eta</script> is a suitably set learning rate.</p>
<p>In previous literature, it is assumed that outputs are sampled stochastically from the logging model. This leads to the Inverse Propensity Scoring (IPS) estimator, which can correct the bias introduced by the logging model via important sampling:</p>
<p><script type="math/tex">\mathcal{V}_{\mathrm{IPS}}(\pi_w) = \frac{1}{n} \sum_{t=1}^n \delta_t \frac{\pi_w(y_t \vert x_t)}{\mu(y_t \vert x_t)}</script>.</p>
<p>However, <strong>sampling is dangerous because we are at risk of showing inferior outputs to a user</strong>, which would lead to a bad user experience. Imagine in the context of machine translation, if one samples from the model output, there is a high risk that the sampled output is not actually a correct translation. For this reason, we want to always select the most likeliest output. <strong>This leads to deterministic logging</strong> where <script type="math/tex">\mu(y_t \vert x_t)=1</script> for all instances. Consequently, the importance sampling is disabled. We refer to this estimator as Deterministic Propensity Matching (DPM):</p>
<p><script type="math/tex">\mathcal{V}_{\mathrm{DPM}}(\pi_w) = \frac{1}{n} \sum_{t=1}^n \delta_t \pi_w(y_t \vert x_t)</script>.</p>
<p><strong>We would now like to find out if the deterministic DPM estimator can be used instead of the stochastic IPS estimator for sequence-to-sequence tasks in NLP.</strong></p>
<h3 id="machine-translation">Machine Translation</h3>
<p>To investigate whether DPM is feasible in comparison to IPS, we set up a machine translation experiment with simulated feedback. Given an out-of-domain MT system, the system translates in-domain data. To simulate feedback, we use available gold reference. This allows us to create stochastic and deterministic logs where both logs have the same feedback signal.</p>
<p>Both IPS and DPM suffer from high variance and can exhibit degenerative behaviour (see Chapter 7.2 in the <a href="https://www.cl.uni-heidelberg.de/~lawrence/Lawrence18.pdf">thesis</a>). To combat this, we add 2 control variates to each estimator, a multiplicative and an additive control variate (for an overview of control variates see <a href="http://cbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/VarianceReductionTechniquesForStochasticOptimization.pdf">the great slides by Matthew W. Hoffman</a>. This leads to the stochastic ĉDoubly Robust (ĉDR) and the deterministic ĉDoubly Controlled (ĉDC) estimator.</p>
<p>We run experiments on two separate datasets and in both cases the deterministic estimator performs as well as the stochastic one. From this, we conclude that <strong>deterministic logging is viable for sequence-to-sequence NLP tasks</strong> because there is enough implicit exploration at the word level (see Chapter 7.3.4 in the <a href="https://www.cl.uni-heidelberg.de/~lawrence/Lawrence18.pdf">thesis</a>).</p>
<p>However, we still need to show that counterfactual off-policy learning is possible for sequence-to-sequence NLP tasks when the feedback is obtained from real human users. We tackle this in the next section.</p>
<h3 id="question-answering-nlmaps-v2-1">Question-Answering: NLmaps v2</h3>
<p>We noted earlier that it is difficult for some question-answering domains to obtain gold answers, e.g. in the case of the OSM domain where we, for example, can’t expect a human to count 951 hotels. As the OSM query language is relatively unknown, it is also difficult to obtain gold parses. Thus, counterfactual off-policy learning, where no gold answers are required, is particularly suitable for the OSM domain.</p>
<p>However, given for example the question “How many hotels are there in Paris?” and a corresponding answer, e.g. “951” or “1,003”, a human still cannot judge whether “951” or “1,003” are correct or not. To solve this issue, we instead propose to make the underlying parse human understandable. We do this by automatically converting the parse into a set of statements that can easily be judged as right or wrong. You can see what this looks like for our example in the following picture:</p>
<p><img src="/statnlpgroup/images/blog/2019-08_feedback.jpg" alt="feedback" /></p>
<p>Once the form is filled out, we can map the individual statements back to the tokens in the parse the produced them. With this approach we collected feedback for 1<script type="math/tex">k</script> question-parse pairs from 9 humans.</p>
<p>For this task, we again employ a neural model. Because neural models produce their output token by token, we can assign feedback at the token level. That is particularly ideal for our situation because the feedback form already collects feedback at a token level. This leads to the new objective, called DPM+T.</p>
<p>The DPM+T objective does not employ a control variate, but we would like to do so to reduce variance. The multiplicative control variate, reweighting (<a href="https://papers.nips.cc/paper/5748-the-self-normalized-estimator-for-counterfactual-learning">Swaminathan and Joachims, 2015</a>), we used previously is not applicable to stochastic minibatch learning. To be applicable, we modify this control variate, leading to a new control variate that we refer to as One-Step-Late reweighting (OSL). Together with the previous new objective, this leads to the combined objective, DPM+T+OSL (for more information, see Chapter 8 of the <a href="https://www.cl.uni-heidelberg.de/~lawrence/Lawrence18.pdf">thesis</a>).
<strong>DPM+T+OSL is the best objective for both learning from the 1<script type="math/tex">k</script> human feedback instances as well as learning from a larger, but simulated log of 22<script type="math/tex">k</script> feedback instances.</strong></p>
<h3 id="comparison-of-both-learning-approaches">Comparison of both learning approaches</h3>
<p>Because we employ the same NLmaps task and the same neural network architecture for both approaches to learn from feedback, we can directly compare the two approaches.</p>
<p>Unsurprisingly <strong>response-based learning outperforms counterfactual learning significantly because it has a better learning signal available</strong>. Because response-based learning has a downstream gold target at hand, it can obtain feedback for arbitrarily many model outputs. Counterfactual learning instead only has access to one model output and its feedback. Furthermore, that model output is biased by the logging policy.</p>
<p>Ultimately, the <strong>choice between response-based and counterfactual learning reduces to how expensive it is to obtain gold targets</strong>. For example, for the OSM domain, it is impractical to obtain gold parse as well as gold answers because the parse can only be written by a handful of people and the answers are too cumbersome to derive for humans. In such a situation, obtaining feedback to model outputs from human users is a viable alternative.</p>
<p>If the base model is good enough, this feedback can directly be collected while real users are interacting with the system. Otherwise, another option would be to recruit human workers to provide the needed feedback.</p>
<p>So in conclusion: <strong>counterfactual learning should be chosen if gold targets are impossible, too time consuming or too expensive to obtain, whereas feedback for model outputs can be collected easily. Otherwise, response-based learning is the better approach because the available gold targets offer a stronger learning signal.</strong> For an overview of this, also see the following diagram:</p>
<p><img src="/statnlpgroup/images/blog/2019-08_decision.jpg" alt="decision" /></p>
<h2 id="conclusion"><a name="conclusion">Conclusion</a></h2>
<p>It is a good idea to explore how to learn from feedback given to model outputs for several reasons, the primary one being that the <strong>collection of direct gold targets might be too expensive</strong>.</p>
<p>In my thesis, I explored <strong>two separate approaches to learn from feedback</strong>, response-based and counterfactual learning. Response-based learning assumes that indirect gold targets are available. Counterfactual learning does not require gold targets and instead saves feedback given by humans interacting with a deployed system in a log.</p>
<p>If (indirect) gold targets can be obtained, <strong>response-based learning is the more promising approach</strong> because the gold targets offer a stronger learning signal. However, for <strong>situation where it is not possible to collect direct or indirect gold targets, counterfactual learning offers a viable alternative</strong>.</p>
<p>Next to exploring how to learn from feedback, it was important to me during my PhD project to keep a concrete user application in mind. To this end, I developed a <a href="https://nlmaps.cl.uni-heidelberg.de/">natural language interface to OpenStreetMap (OSM)</a>.</p>
<p>My PhD project was a long, but very rewarding journey. I learnt so much and got to join a great NLP community. Special thanks go to my supervisor, Stefan Riezler, who always encouraged my ideas and guided me to the path that led to my thesis. I also want to thank all my colleagues who were always willing to listen and offer suggestions.</p>
<p>If you enjoyed this post and want to discuss anything further, feel free to reach out to me via <a href="mailto:Carolin.Lawrence@neclab.eu">e-mail</a> or <a href="https://twitter.com/caro__lawrence" target="_blank">twitter</a>.</p>
<p><strong>More information can be found in the <a href="https://www.cl.uni-heidelberg.de/~lawrence/Lawrence18.pdf">thesis</a>.</strong></p>
<p><strong>Acknowledgment: Thanks to Stefan Riezler and Mayumi Ohta for their valuable feedback to improve this post.</strong></p>
<p><strong>Disclaimer: This blogpost reflects solely the opinion of the author, not any of her affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.</strong></p>
<p><strong>If you want to cite this blogpost, use this
<a href="" data-target="collapse__key" onclick="toggleBib(this);return false;">
bib<span class="bib__close"><i class="fas fa-chevron-up"></i></span><span class="bib__open"><i class="fas fa-chevron-down"></i></span>
</a>.</strong></p>
<div class="bib__raw-bibtex bib__hide" id="collapse__key">
<pre>@misc{Lawrence:19,
author = {Lawrence, Carolin},
title = {Response-Based and Counterfactual Learning for Sequence-to-Sequence Tasks in NLP: An Overview},
journal = {StatNLP HD Blog},
type = {Blog},
number = {August},
year = {2019},
howpublished = {\url{https://www.cl.uni-heidelberg.de/statnlpgroup/blog/lff/}}
}
</pre>
</div>CarolinThis post presents a summary of my PhD thesis. I explored how to learn from feedback given to model outputs when the collection of direct supervision signals is too costly. I also built a natural language interface to the geographical database OpenStreetMap.The Real Challenge of Real-World Reinforcement Learning: The Human Factor2019-07-26T00:00:00+00:002019-07-26T00:00:00+00:00http://www.cl.uni-heidelberg.de/statnlpgroup/blog/HRL<blockquote>
<p>The full potential of reinforcement learning requires reinforcement learning agents to be embedded into the flow of real-world experience, where they act, explore, and learn in our world, not just in their worlds. (Sutton & Barto (2018). Reinforcement Learning. An Introduction. 2nd edition)</p>
</blockquote>
<p>Recent well-recognized research has shown that artificial intelligence agents can achieve human-like or even superhuman performance in playing Atari games (<a href="https://www.nature.com/articles/nature14236">Mnih et al. 2015</a>), or the game of Go (<a href="https://www.nature.com/articles/nature16961">Silver et al. 2016</a>), without human supervision (<a href="https://www.nature.com/articles/nature24270">Silver et al. 2017</a>), but instead using reinforcement learning techniques for many rounds of self-play.
This is a huge achievement in artificial intelligence research, opening the doors for applications where supervised learning is (too) costly, and with ramifications for many other application areas beyond gaming. The question arises how to transfer the superhuman achievements of RL agents under clean room conditions like gaming (where reward signals are well-defined and abundant) to real-world environments with all their shortcomings, first and foremost, the shortcomings of human teachers (who obviously would not pass the Turing test, as indicated in the comic below).</p>
<figure class="align-center" style="max-width: 15cm;">
<img src="/statnlpgroup/images/blog/ProfessorsTuringTest.gif" />
</figure>
<h2 id="the-human-factor-in-real-world-rl-for-natural-language-processing">The human factor in real-world RL for natural language processing</h2>
<p>Let us have a look at human learning scenarios, for example, natural language translation: A human student of translation and interpretation studies has to learn how to produce correct translations from a mix of feedback types. The human teacher will provide supervision signals in form of gold standard translations in some cases. However, in most cases the student has to learn from weaker teacher feedback that signals how well the student accomplished the task, without knowing what would have happened if the student had produced a different translation, nor what the correct translation should look like. In addition, the best students will become like teachers in that they acquire a repertoire of strategies to self-regulate their learning process (<a href="https://journals.sagepub.com/doi/full/10.3102/003465430298487">Hattie and Timperley 2007</a>).</p>
<p>Now, if our goal is to build an artificial intelligence agent that learns to translate like a human student, in interaction with a professional human translator acting as the teacher, we see the same pattern of a cost-effectiveness tradeoff: The human translator will not want to provide a supervision signal in form of a correct translation as feedback to every translation produced by the agent, even if this signal is the most informative. Rather, in some cases weaker feedback signals on the quality of the system output, or on parts of it, are a more efficient way of student-teacher interaction. Another scenario is users of online translation systems: They act as consumers - sometimes they might give a feedback signal, but rarely a fully correct translation.</p>
<p>We also see a similar pattern in the quality of the teacher’s feedback signal when training a human and when training an agent: The human teacher of the human translation student and the professional translator acting as a human teacher of the artificial intelligence agent are both human: Their feedback signals can be ambiguous, misdirected, sparse, in short - only human (see the comic above). This is a stark difference to the scenarios in which the success stories of RL have been written - gaming . In these environments reward signals are unambiguous, accurate, and plentiful. One might say that the RL agents playing games against humans received an unfair advantage of an artificial environment that suits their capabilities. However, in order to replicate these success stories for RL in scenarios with learning from human feedback, we should not belittle these successes, but learn from them: The goal should be to give the RL agents that learn from human feedback any possible advantage to succeed in this difficult learning scenario. For this we have to better understand what the real challenges of learning from human feedback consist of.</p>
<h2 id="disclaimer">Disclaimer</h2>
<p>In difference to previous work on learning from human reinforcement signals (see, for example, <a href="https://dl.acm.org/citation.cfm?id=1597738">Knox and Stone</a>, <a href="https://arxiv.org/abs/1706.03741">Christiano et al. 2017</a>, <a href="https://arxiv.org/abs/1811.07871">Leike et al. 2018</a>), our scenario is not one where human knowledge is used to reduce the sample complexity and thus to speed up the learning process of the system, but one where no other reward signals than human feedback are available for interactive learning. This scenario applies to many personalization scenarios where a system that is pre-trained in a supervised fashion is adapted and improved in an interactive learning setup from feedback of the human user. Examples are online advertising, or, machine translation, which we will focus on here.</p>
<p>Recent work (<a href="https://arxiv.org/abs/1904.12901v1">Dulac-Arnold et al. 2019</a>) has recognized that the poorly defined realities of real-world systems are hampering the progress of real-world reinforcement learning. They address, amongst others, issues such as off-line learning, limited exploration, high-dimensional action spaces, or unspecified reward functions. These challenges are important in RL for control systems or robots grounded in the physical world, however, they severly underestimate the human factor in interactive learning. We will use their paper as a foil to address several recognized challenges in real-world RL.</p>
<h2 id="counterfactual-learning-under-deterministic-logging">Counterfactual learning under deterministic logging</h2>
<p>One of the issues addressed in <a href="https://arxiv.org/abs/1904.12901v1">Dulac-Arnold et al. 2019</a> is the need for off-line or off-policy RL in applications where systems cannot be updated online. Online learning is unrealistic in commercial settings due to latency requirements and the desire for offline testing of system updates before deployment. A natural solution would be to exploit counterfactual learning that reuses logged interaction data where the predictions have been made by a historic system different from the target system.</p>
<figure class="align-center" style="max-width: 7cm;">
<img src="/statnlpgroup/images/blog/loglearn_schema_no_observation.png" />
</figure>
<p>However, both online learning and offline learning from logged data are plagued by the problem that <strong>exploration is prohibitive in commercial systems since it means to show inferior outputs to users</strong>. This effectively results in deterministic logging policies that lack explicit exploration, making an application of standard off-policy methods questionable. For example, techniques such as inverse propensity scoring (<a href="https://www.jstor.org/stable/2335942">Rosenbaum and Rubin 1983</a>), doubly-robust estimation (<a href="https://arxiv.org/abs/1103.4601">Dudik et al. 2011</a>), or weighted importance sampling (<a href="https://www.semanticscholar.org/paper/Eligibility-Traces-for-Off-Policy-Policy-Evaluation-Precup-Sutton/44fe9e7f22f8986d48e3753543792d28b0494db0">Precup et al. 2000</a>, <a href="https://arxiv.org/abs/1511.03722">Jiang and Li 2016</a>, <a href="https://arxiv.org/abs/1604.00923">Thomas and Brunskill 2016</a>) all rely on sufficient exploration of the output space by the logging system as a prerequisite for counterfactual learning. In fact, <a href="https://dl.acm.org/citation.cfm?id=1390223">Langford et al. 2008</a> and <a href="https://arxiv.org/abs/1003.0120">Strehl et al. 2010</a> even give impossibility results for exploration-free counterfactual learning.</p>
<p><strong>Clearly, standard off-policy learning does not apply when commercial systems interact safely, i.e., deterministically with human users!</strong></p>
<p>So what to do? One solution is to hope for <strong>implicit exploration due to input or context variability</strong>. This has been observed for the case of online advertising (<a href="https://papers.nips.cc/paper/4321-an-empirical-evaluation-of-thompson-sampling">Chapelle and Li 2012</a>) and investigated theoretically (<a href="https://arxiv.org/abs/1704.09011v5">Bastani et al. 2017</a>). However, natural exploration is something inherent in the data, not something machine learning can optimize for.</p>
<p>Another solution is to consider concrete cases of degenerate behavior in estimation from deterministically logged data, and find solutions that might repeal the impossibility theorems. One such degenerate behavior consists in the fact that the empirical reward over the data log can be maximized by setting probability of all logged data to 1. However, it is clearly undesirable to increase the probability of low reward examples (<a href="https://papers.nips.cc/paper/5748-the-self-normalized-estimator-for-counterfactual-learning">Swaninathan and Joachims 2015</a>, <a href="https://arxiv.org/abs/1711.08621">Lawrence et al. 2017a</a>, <a href="https://arxiv.org/abs/1707.09118">Lawrence et al. 2017b</a>). A solution to the problem, called <strong>deterministic propensity matching</strong>, has been presented by <a href="https://arxiv.org/abs/1811.12239">Lawrence and Riezler 2018a</a>, <a href="https://arxiv.org/abs/1805.01252">Lawrence and Riezler 2018b</a> and been tested with real human feedback in a semantic parsing scenario. The central idea is as follows:
Consider logged data <script type="math/tex">D = \{(\mathbf{x}^{(h)}, \mathbf{y}^{(h)}, r(\mathbf{y}^{(h)}))\} ^H_{h=1}</script>, where <script type="math/tex">\mathbf{y}^{(h)}</script> is sampled from a logging system <script type="math/tex">\mu(\mathbf{y}^{(h)}|\mathbf{x}^{(h)})</script>, and the reward <script type="math/tex">r(\mathbf{y}^{(h)}) \in [0,1]</script> is obtained from a human user. One possible objective for off-line learning under deterministic logging is to maximize the expected reward of the logged data</p>
<script type="math/tex; mode=display">L(\theta) = \frac{1}{H}\sum_{h=1}^H r(\mathbf{y}^{(h)}) \, \bar{p}_{\theta,\theta'}(\mathbf{y}^{(h)}|\mathbf{x}^{(h)}),</script>
<p>where a multiplicative control variate (<a href="https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwi898Dyw8vjAhUfSBUIHUqIAAIQFjAAegQIAhAC&url=https%3A%2F%2Fgalton.uchicago.edu%2Ftechreports%2Ftr348.pdf&usg=AOvVaw3uHX3xRujlZSV05izCD-1X">Kong 1992</a>) is used for reweighting, evaluated one-step-late at <script type="math/tex">\theta'</script> from some previous iteration (for efficient gradient calculation), where</p>
<script type="math/tex; mode=display">\bar{p}_{ \theta,\theta'}(\mathbf{y}^{(h)}|\mathbf{x}^{(h)}) = \frac{p_{ \theta}(\mathbf{y}^{(h)}|\mathbf{x}^{(h)})}{\sum_{b=1}^B p_{ \theta'}(\mathbf{y}^{(b)}|\mathbf{x}^{(b)})}.</script>
<p>The effect of this self-normalization is to prevent that the probability of low reward data can be increased in learning by taking away probability mass from higher reward outputs. This introduces a bias in the estimator (that decreases as <script type="math/tex">B</script> increases), however, it makes learning under deterministic logging feasible, thus giving the RL agent an edge in learning in an environment that has been deemed impossible in the literature. See also <a href="/statnlpgroup/blog/parsing_when_gold_answers_unattainable/">Carolin’s blog</a> describing the semantic parsing scenario.</p>
<h2 id="learning-reward-estimators-from-human-bandit-feedback">Learning reward estimators from human bandit feedback</h2>
<p>Other issues addressed prominently in <a href="https://arxiv.org/abs/1904.12901v1">Dulac-Arnold et al. 2019</a> are the problems of learning from limited samples, in high dimensional action spaces, with unspecified reward functions. This is a concise description of the learning scenario in interactive machine translation: Firstly, it is <strong>unrealistic to expect anything else than bandit feedback from a human user using a commercial machine translation system</strong>. That is, a user of an machine translation system will only provide a reward signal to one deterministically produced best system output, and cannot be expected to rate a multitude of translations for the same input. Providers of commercial machine translation systems realize this and provide non-intrusive interfaces for user feedback that allow to post-edit translations (negative signal), or to copy and/or share the translation without changes (positive signal). Furthermore, human judgements on the quality of full translations need to cover an <strong>exponential output space</strong>, while the <strong>notion of translation quality is not a well-defined function</strong> to start with: In general every input sentence has a multitude of correct translations, each of which humans may judge differently, depending on many contextual and personal factors.</p>
<p>Surprisingly, the question of how to give the RL agent an advantage in learning from real-world human feedback has been scarcely researched. The suggestions in <a href="https://arxiv.org/abs/1904.12901v1">Dulac-Arnold et al. 2019</a> may seem straightforward - warm-starting agents to decrease sample complexity or using inverse reinforcement learning to recover reward functions from demonstrations - but they require additional supervision signals that RL was supposed to alleviate. Furthermore, when it comes to the question which type of human feedback is most beneficial for training an RL agent, one finds a lot of blanket statements referring to the advantages of pairwise comparisons to produce a scale (<a href="https://psycnet.apa.org/record/1928-00527-001">Thurstone 1927</a>), however, without providing any empirical evidence.</p>
<p>An exception is the work of <a href="https://arxiv.org/abs/1805.10627">Kreutzer et al. 2018</a>. This work is one of the first to investigate the question which type of human feedback - pairwise judgements or cardinal feedback on a 5-point scale - can be given most reliably by human teachers, and which type of feedback allows to learn reward estimators that best approximate human rewards and can be best integrated into an end-to-end RL task. Let’s look at example interfaces for 5-point feedback and pairwise judgements:</p>
<figure class="align-center" style="max-width: 12cm;">
<img src="/statnlpgroup/images/blog/interface-highlight.png" />
<img src="/statnlpgroup/images/blog/interface-pw2-highlight.png" />
</figure>
<p>Contrary to common belief, <strong>inter-rater reliability was higher for 5-point ratings</strong> (Krippendorff’s <script type="math/tex">\alpha =0.51</script>) than for pairwise judgements (<script type="math/tex">\alpha=0.39</script>) in the study of <a href="https://arxiv.org/abs/1805.10627">Kreutzer et al. 2018</a> . They explain this by the possibility to standardize cardinal judgements for each rater to get rid of individual biases, and due to filtering out raters with low intra-rater reliability. The main problem for pairwise judgements were distinctions between similarly good or bad translations, which could be filtered out to improve intra-rater reliability, yielding the final inter-rater reliability given above.</p>
<p>Furthermore, when training reward estimators on judgments collected for 800 translations, they measured learnability by the correlation between estimated rewards and translation edit rate to human reference translations. They found that <strong>learnablity was better for a regression model trained on 5-point feedback than for a Bradley-Terry model trained on pairwise rankings</strong> (as recently used for RL from human preferences by <a href="https://arxiv.org/abs/1706.03741">Christiano et al. 2017</a>).</p>
<p>Finally, and most importantly, when <strong>integrating reward estimators into an end-to-end RL task, they found that one can improve a neural machine translation system by more than 1 BLEU point by a reward estimator trained on only 800 cardinal user judgements</strong>. This is not only a promising result pointing in the direction in which future research for real-world RL could happen, but it also solves all three of the above mentioned challenges of <a href="https://arxiv.org/abs/1904.12901v1">Dulac-Arnold et al. 2019</a> (limited samples, high dimensional action spaces, unspecified reward functions) in one approach: Reward estimators can be trained on very small datasets, and then be integrated as reward functions over high dimensional action spaces. The idea is to tackle the arguably simpler problem of learning a reward estimator from human feedback first, then provide unlimited learned feedback to generalize to unseen outputs in off-policy RL.</p>
<h2 id="further-avenues-self-regulated-interactive-learning">Further avenues: Self-regulated interactive learning</h2>
<p>As mentioned earlier, human students have to be able learn in situations where the most informative learning signals are the sparsest. This is because teacher feedback comes at a cost so that the most precious feedback of gold standard outputs has to be requested economically. Furthermore, students have to learn how to self-regulate their learning process and learn when to seek help and which kind of help to seek. This is different to classic RL games where the cost of feedback is negligible (we can simulate games forever), but this is not realistic in the real world, where especially exploration can get very costly (and dangerous).</p>
<p><strong>Learning to self-regulate</strong> is a new research direction that tries to <strong>equip an artificial intelligence agent with a decision-making ability that is traditionally hard for humans - balancing cost and effect of learning from different types of feedback,</strong> including full supervision by teacher demonstration or correction, weak supervision in the form of positive or negative rewards for student predictions, or a self-supervision signal generated by the student.</p>
<figure class="align-center" style="max-width: 11cm;">
<img src="/statnlpgroup/images/blog/Active_RL.png" />
</figure>
<p><a href="https://arxiv.org/abs/1907.05190">Kreutzer and Riezler 2019</a> have shown how to cast self-regulation as a learning-to-learn problem that solves the above problem by making the agent aware of and manage the cost-reward trade-off. They find in simulation experiments on interactive neural machine translation that the self-regulator is a powerful alternative to uncertainty-based active learning (<a href="https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwi3546ZrtDjAhWRr6QKHTJ7AgMQFjAAegQIBRAC&url=https%3A%2F%2Fwww.biostat.wisc.edu%2F~craven%2Fpapers%2Fsettles.emnlp08.pdf&usg=AOvVaw2hhRs69DCAsD2fv79JuL6b">Settles and Craven 2008</a>), and discovers an <script type="math/tex">\epsilon</script>-greedy strategy for the optimal cost-quality trade-off by mixing
different feedback types including corrections, error markups, and self-supervision. Their simulation scenario of course abstracts away from certain confounding variables to be expected in real-life interactive machine learning, however, all of these are interesting directions for new research on real-life RL with human teachers.</p>
<h2 id="the-appeal-of-rl-from-human-feedback">The appeal of RL from human feedback</h2>
<p>I tried to show that some of the challenges in real-world RL originate from the human teachers who have been considered a help in previous work (<a href="https://dl.acm.org/citation.cfm?id=1597738">Knox and Stone</a>, <a href="https://arxiv.org/abs/1706.03741">Christiano et al. 2017</a>, <a href="https://arxiv.org/abs/1811.07871">Leike et al. 2018</a>): In situations where only the feedback of a human user is available to personalize and adapt an artificial intelligence agent, the standard tricks of memorizing large amounts of labels in supervised learning, or training in unlimited rounds of self-play with cost-free and accurate rewards in RL, won’t do the job. If we want move RL into the uncharted territories of training artificial intelligence agents from feedback of cost-aware, unfathomable human teachers, we need to make sure the agent does not depend on massive exploration, and we have to learn great models of human feedback. It will be interesting to see how and what artificial intelligence agents learn in the same information-deprived situations that human students have to deal with, and hopefully, it will lead to artificial intelligence agents that can support humans by smoothly adapting to their needs.g</p>
<p><strong>Acknowledgment: Thanks to Julia Kreutzer and Carolin Lawrence for our joint work and their valuable feedback on this post.</strong></p>
<p><strong>Disclaimer: This blogpost reflects solely the opinion of the author, not any of his affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.</strong></p>
<p><strong>Comments, ideas and critical views are very welcome. We appreciate your feedback! If you want to cite this blogpost, use this
<a href="" data-target="collapse__key" onclick="toggleBib(this);return false;">
bib<span class="bib__close"><i class="fas fa-chevron-up"></i></span><span class="bib__open"><i class="fas fa-chevron-down"></i></span>
</a>.</strong></p>
<div class="bib__raw-bibtex bib__hide" id="collapse__key">
<pre>@misc{riezler:hrl:19,
author = {Riezler, Stefan},
title = {The Real Challenges of Real-World Reinforcement Learning: The Human Factor},
journal = {StatNLP HD Blog},
type = {Blog},
number = {July},
year = {2019},
howpublished = {\url{https://www.cl.uni-heidelberg.de/statnlpgroup/blog/hrl/}}
}
</pre>
</div>StefanHow can we give RL agents that learn from human feedback a possible advantage to succeed in this difficult learning scenario?Counterfactual Learning of Semantic Parsers When Even Gold Answers Are Unattainable2019-01-14T00:00:00+00:002019-01-14T00:00:00+00:00http://www.cl.uni-heidelberg.de/statnlpgroup/blog/parsing-overview<!--# Counterfactual Learning of Semantic Parsers When Even Gold Answers Are Unattainable-->
<p>In semantic parsing, natural language questions are mapped to semantic parses. A semantic parse can be executed against a database to obtain an answer. This answer can then be presented to a user.</p>
<p>Semantic parsers for question-answering can be employed in virtual personal assistants which are increasingly on the rise in recent years. Because such assistants are desired to help on an increasing number of tasks, we need to explore the best possible options to efficiently and effectively set up a parser for a new domain, to adapt them for specific user needs and to generally ensure that they improve.</p>
<p>However, obtaining labelled data can be challenging. In this post, we first consider the different possible supervision signals that can be used to train a semantic parser. This influences which objectives can be used for training, which we explore in the second part.</p>
<h2 id="supervision-signal">Supervision Signal</h2>
<h3 id="question-parse-pairs">Question-Parse Pairs</h3>
<p>To train a semantic parser, direct supervision means the collection of question-parse pairs. This can be difficult if the parse language is only understood by expert users. One option is to ensure that the parse language is as broad as possible, e.g. by choosing SQL (<a href="http://aclweb.org/anthology/P17-1089">Iyer et al., 2017</a>). However, even in the case of SQL, experts are required for the annotation, which can get quickly very expensive.</p>
<h3 id="question-answer-pairs">Question-Answer Pairs</h3>
<p>An alternative option is to employ a weaker supervision signal. Collecting question-answer pairs is easier for many domains (<a href="www.aclweb.org/anthology/D/D13/D13-1160.pdf">Berant et al., 2013</a>; <a href="http://www.aclweb.org/anthology/D14-1070">Iyyer et al., 2014</a>; <a href="http://www.aclweb.org/anthology/D15-1237">Yang et al., 2015</a>; <em>inter alia</em>) and can typically be done by non-experts.</p>
<p>However, the weaker supervision signal from question-answer presents a harder learning task. While the gold answer is known, it remains unclear which parse will lead to the gold answer. During training, the parser has to explore the output space to find a parse that executes to the correct gold answer. This search can be difficult as the output space is large. Furthermore, instead of finding a parse that represents the correct meaning of the question, one might find a <strong>spurious</strong> parse instead. Such a parse happens to execute to the gold answer, but conveys the wrong meaning. This hampers generalisation.</p>
<p>For example, assume we have the question “Are there any bars?” and instead of mapping “bar” to the logical form for “<bar>", the parser maps it to the logical form of "restaurant" instead. If the answer for both "Are there any bars?" and "Are there any restaurants?" is "Yes", then the wrong logical form "restaurant" for the question "Are there any bars?", will lead to the correct answer. The parser has now wrongly learnt to map "bar" to the logical form "restaurant" and for other questions, such as "Where is the closest bar?" it will now return the closest restaurant instead.</bar></p>
<h3 id="comparison-question-parse-vs-question-answer-pairs">Comparison: Question-Parse vs. Question-Answer Pairs</h3>
<p><a href="http://www.aclweb.org/anthology/P16-2033">Yih et al., 2016</a> investigated the cost and benefit of obtaining question-parse pairs compared to collecting question-answer pairs. For this, they use the WebQuestion corpus <a href="www.aclweb.org/anthology/D/D13/D13-1160.pdf">Berant et al., 2013</a> which is based on the <a href="https://developers.google.com/freebase/">Freebase Database</a>. The corpus was originally collected with the help of non-expert crowd-source workers in the form of question-answer pairs. <a href="http://www.aclweb.org/anthology/P16-2033">Yih et al., 2016</a> annotate each question in the corpus with corresponding gold parses. To ease the annotation, they designed a simple user interface and hired experts familiar with Freebase.</p>
<p>Next, they compared a system trained on question-parse pairs to a system trained on question-answer pairs. In their experiments, they were able to show three, in part surprising, results:</p>
<ol>
<li>
<p>The model trained on question-parse pairs outperforms the model on question-answer pairs by over 5 percentage points in answer accuracy.</p>
</li>
<li>
<p>Answer annotation by crowd-source workers is often incorrect, in their evaluation it was incorrect 34\% of the time.</p>
</li>
<li>
<p>With an easy to use interface, experts can write the correct semantic parse faster than they can retrieve the correct answer.</p>
</li>
</ol>
<p>Observation 1. does not come as a surprise as question-parse pairs offer a stronger learning signal. But both 2. and 3. are surprising. However, as noted previously, hiring experts to annotate gold parses can be expensive.</p>
<p>A further problem arises for domains where it is not easy to collect gold answers. For example, when answers are open-ended lists, fuzzily defined or very large.</p>
<p>This is for example the case on the domain of geographical question-answering using the OpenStreetMap database. Here, the underlying parse language is only known to a few expert users, which makes the collection of gold parses particularly difficult. Furthermore, it is often impossible to collect gold answers because in many cases the gold answer set is too large or fuzzily defined (e.g. when searching for objects “near” another one) to be obtained in a reasonable amount of time or without error.</p>
<h3 id="question-feedback-pairs">Question-Feedback Pairs</h3>
<p>In cases were both the collection of gold parses and gold answers is infeasible, we need to obtain a learning signal from other sources. One option is to obtain feedback from users while they are interacting with the system (<a href="http://aclweb.org/anthology/P18-1169">Lawrence&Riezler 2018</a>).</p>
<p>For this, a baseline semantic parser is trained on a small amount of question-parse pairs. This parser can be used to parse further questions for which neither gold parses nor gold answers exist. The parse suggested by the baseline, can then be automatically transformed into a set of human understandable statements. Given to human users, they can easily judge each statement as correct or incorrect. This feedback can be used to further improve the parser.</p>
<p>For example, below is a question and the statements automatically generated from the corresponding parse.</p>
<p><img src="/statnlpgroup/images/blog/2018-11-14_parsing_overview.png" alt="" /></p>
<p>With the filled in form, we know which parts of the parse are wrong.</p>
<p>This allows us to go further than just promoting correct parses. For each statement, we are able to map it back to the tokens in the parse that produced it. This allows us to learn from partially correct parses, where we only promote the tokens associated with correct statements.</p>
<h2 id="objectives">Objectives</h2>
<p>The collected data decides which objectives can be applied during training. Below we give an overview of various objectives, which data they require and what their advantages and disadvantages are.</p>
<p>First off, here is some general notation:</p>
<ul>
<li><script type="math/tex">\pi_w</script>: neural network with parameters <script type="math/tex">w</script></li>
<li><script type="math/tex">x = x_1, x_2, \dots x_{\mid x\mid }</script>: input question</li>
<li><script type="math/tex">y = y_1, y_2, \dots y_{\mid y\mid }</script>: output parse</li>
<li><script type="math/tex">\bar{y} = \bar{y}_1, \bar{y}_2, \dots \bar{y}_{\mid \bar{y}\mid }</script>: gold parse</li>
<li><script type="math/tex">\bar{a}</script>: gold answer</li>
</ul>
<p>We define an objective in terms of a loss function <script type="math/tex">\mathcal{L}</script>. For training, we derive it with regards to the model’s parameters <script type="math/tex">w</script> to make (stochastic) gradient descent updates, <script type="math/tex">w = w - \eta \nabla_w \mathcal{L}</script>, where <script type="math/tex">\eta</script> is a suitable learning rate.</p>
<h3 id="question-parse-pairs-maximum-likelihood-estimation-mle">Question-Parse Pairs: Maximum Likelihood Estimation (MLE)</h3>
<p>Neural networks are typically trained using MLE, where the probability of a gold parse <script type="math/tex">\bar{y}</script> is raised for given a question <script type="math/tex">x</script> (e.g. <a href="http://www.aclweb.org/anthology/P/P16/P16-1004.pdf">Dong & Lapata, 2016</a> or <a href="http://www.aclweb.org/anthology/P16-1002">Jia & Liang, 2016</a>). The objective is defined as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\mathcal{L}_{MLE} = - \sum_{j=1}^{\mid \bar{y}\mid } \log \pi_w(\bar{y}_{j} \mid \bar{y}_{<j}, x), %]]></script>
<p>where <script type="math/tex">% <![CDATA[
\bar{y}_{<j} = y_{1}, y_{2}, \dots y_{j-1}. %]]></script></p>
<p>However, this approach is only possible if gold targets <script type="math/tex">\bar{y}</script> are available. As mentioned in the first section, obtaining these might be too expensive in praxis and weaker supervision signals are the practical alternative.</p>
<p>There is a further reason for a different objective, even when question-parse pairs are available:</p>
<p>There might be other parses, not just the annotated gold parse, that lead to the correct answer. But these can never be discovered if the MLE objective is used. Discovering further valid parses could stabilise learning and help generalisation. Further, this allows the parser to find suitable parses in its own output space.</p>
<p>Next, we turn to objectives which assume the existence of gold answers. Either from executing gold parses to obtain gold answers or because gold answers where annotated. For these objectives, a parser produces model outputs which are executed to obtain a corresponding answers. The answers can be compared to the available gold answer and a reward can be assigned to the various model outputs.</p>
<h3 id="question-answer-pairs-reinforce-and-minimum-risk-training-mrt">Question-Answer Pairs: REINFORCE and Minimum Risk Training (MRT)</h3>
<p>Recently, there has been a popular surge of applying reinforcement learning approaches, in particular the REINFORCE algorithm (<a href="http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf">Williams 1992</a>), to (weakly) supervised NLP tasks. The inherent issues that arise in this context, are also explored with regards to neural machine translation in <a href="https://statnlp.github.io/rl4nmt">another blog post</a>.</p>
<p>We will first introduce the REINFORCE algorithm, then discuss potential issues.</p>
<p>In REINFORCE, given an input question <script type="math/tex">x</script>, <strong>one</strong> output <script type="math/tex">y</script> is sampled from the current model distribution (see Section 13.3 of <a href="https://drive.google.com/file/d/1opPSz5AZ_kVa1uWOdOiveNiBFiEOHjkG/view">Sutton & Barto, 2018</a>). Executing this sampled parse to obtain an answer <script type="math/tex">a</script>, the comparison with the gold answer <script type="math/tex">\bar{a}</script> provides us with a reward <script type="math/tex">\delta</script>. On the basis of this single reward, the model’s parameters are updated, i.e. we can define the following objective:</p>
<script type="math/tex; mode=display">\mathcal{L}_{REINFORCE} = - \delta \pi_w(y\mid x).</script>
<p>However, with just one sample, this objective can suffer from high variance. This can be combated by introducing control variates, which lower variance. The most popular choice is using a baseline, where we keep track of the average reward, which is subtracted from <script type="math/tex">\delta</script>.</p>
<p>But why only sample one output?</p>
<p><strong>We have the luxury of having gold answers available.</strong></p>
<p>This allows us to sample several outputs and obtain rewards for all of them. With this, an average can be computed and on the basis of this average updates to <script type="math/tex">w</script> are performed. First, this lowers the variance. Second, it allows us to try out several model outputs, which helps us to explore the output space and in turn increases our chance of finding a parse that leads to the correct answer.</p>
<p>Building an average based on several outputs obtained for one input, is exactly the characteristic idea of Minimum Risk Training (MRT).</p>
<p>MRT was introduced in the context of log-linear models for dependency parsing and machine translation (<a href="https://people.cs.umass.edu/~dasmith/dtrain_acl_2006.pdf">Smith & Eisner, 2006</a>). It has also been tested for neural models in the context of machine translation (<a href="http://anthology.aclweb.org/P/P16/P16-1159.pdf">Shen et al., 2016</a>).</p>
<p>Sampling <script type="math/tex">S</script> outputs per input, we can define the following MRT objective:</p>
<script type="math/tex; mode=display">\mathcal{L}_{MRT} = - \frac{1}{S} \sum_{s=1}^{S} \delta \pi_w(y_s\mid x).</script>
<p>This objective is for example employed in <a href="http://www.aclweb.org/anthology/P17-1003">Liang et al., 2017</a>. Although they use the term REINFORCE (“We apply REINFORCE”), their later objective is based upon <script type="math/tex">S</script> outputs (“Thus, in contrast with common practice of approximating the gradient by sampling from the
model, we use the top-<script type="math/tex">k</script> action sequences”), which is reminiscent of MRT. Similarly, <a href="http://aclweb.org/anthology/P17-1097">Guu et al., 2017</a> also calculate an average over several output samples for one input (see their Equation 9). <a href="http://proceedings.mlr.press/v70/mou17a/mou17a.pdf">Mou et al., 2017</a> also take advantage of the gold answers to sample and evaluate several parses for one input (“We adjust the reward by subtracting the mean reward, averaged over sampled actions for a certain data point.”).</p>
<p>MRT is superior because by sampling several outputs per input, it exhibits lower variance than REINFORCE. But it can only be applied if question-answer pairs are available. Additionally, it is more expensive to compute.</p>
<p>For our final scenario from the previous section, where neither gold answers nor gold parses are available and we only have feedback collected for one model output, we are limited to only one sample and MRT cannot be applied.</p>
<p>Let’s see which objectives we can apply in such scenarios.</p>
<h3 id="question-feedback-pairs-reinforce-and-counterfactual-learning">Question-Feedback Pairs: REINFORCE and Counterfactual Learning</h3>
<p>A setup, where only one outputs and its corresponding feedback is available, is also called a bandit learning scenario. The name is inspired from choosing one among several slot machines (colloquially referred to as “one-armed bandit”), where we only observe the reward for the chosen machine (i.e. output) and it remains unknown what reward the other machines (or outputs) would have obtained.</p>
<p>This is a crucial contrast to learning from question-answer pairs. We illustrate this graphically in the figure below. The left side shows the scenario where question-answer pairs are available, whereas the right assumes question-feedback pairs where no gold answers are available.</p>
<p><img src="/statnlpgroup/images/blog/2018-11-14_QA.png" alt="qa" class="align-left" /></p>
<p>REINFORCE is still applicable in bandit learning scenarios. But if we collect feedback as users are using the system, it can be dangerous to update the parser’s parameters online.</p>
<p>The parser’s performance could deteriorate without notice which can lead to user dissatisfaction and monetary loss. It also makes it impossible to explore different hyperparameter setting.</p>
<p>Instead, it is safer to first collect the feedback in a log of triples <script type="math/tex">\mathcal{D}_{log}=\{(x_m,y_m,\delta_m)\}_{m=1}^M</script>. Once enough feedback has been collected, the log can be used to further improve the parser offline. The resulting model can then be validated against additional test sets before it is deployed.</p>
<p>However, once we start learning, the outputs produced in log might no longer be the outputs the updated parser would choose; i.e. the log we collected is biased towards the parser that was deployed at the time. Learning from such a log leads to a counterfactual, or off-policy, learning setup.</p>
<p>The bias in the log can be corrected using <a href="https://en.wikipedia.org/wiki/Importance_sampling">importance sampling</a>, where we divide the probability that the new model <script type="math/tex">\pi_w</script> prescribes to the logged output, by the probability that the deployed parser <script type="math/tex">\mu</script> assigned to that output. This leads to the following Inverse Propensity Score (IPS) objective:</p>
<script type="math/tex; mode=display">\mathcal{L}_{IPS} = - \frac{1}{M} \sum_{m=1}^{M} \delta \frac{\pi_w(y_m\mid x_m)}{\mu(y_m\mid x_m)}.</script>
<p>However, because we ideally want to present only correct parses and answers to our users, we want to always show the most likely output under the currently deployed model. This results in a deterministic log because the probability of choosing the most likely output is always one. Consequently, we can no longer correct the data bias.</p>
<p>This leads to the Deterministic Propensity Matching (DPM) objective:</p>
<script type="math/tex; mode=display">\mathcal{L}_{DPM} = - \frac{1}{M} \sum_{m=1}^{M} \delta \pi_w(y_m\mid x_m).</script>
<p>Just like REINFORCE, both IPS and DPM suffer from high variance because only one output received a reward for each input. It is thus advisable to employ control variates, e.g. one-step-late reweighting (<a href="http://aclweb.org/anthology/P18-1169">Lawrence & Riezler, 2018</a>).</p>
<p>With these objectives it is possible to learn from question-feedback pairs where feedback was collected from users for one system output. This approach is useful for scenarios where neither the collection of gold parses nor the collection of gold answers is feasible.</p>
<p>Furthermore, this approach can also be applied to other tasks, such as machine translation (<a href="http://www.aclweb.org/anthology/D/D17/D17-1272.pdf">Lawrence et al., 2017</a> ; <a href="https://arxiv.org/pdf/1804.05958.pdf">Kreutzer et al., 2018a</a> ; <a href="https://arxiv.org/pdf/1805.10627.pdf">Kreutzer et al., 2018b</a>).</p>
<h2 id="summary">Summary</h2>
<ul>
<li>Semantic parsers are important modules in virtual personal assistants and with an increasing number of domains in which these assistants are used, we need to find efficient and effective methods to train parsers on new domains.</li>
<li>In general, the stronger the learning signal, the better the result. For each new domain, we should estimate the time and cost of the different approaches, while keeping in mind that:
<strong>question-parse pairs > question-answer pairs > question-feedback pairs.</strong></li>
<li>If it is too expensive to obtain gold parses or gold answers, then using counterfactual learning from question-feedback pairs is a viable alternative.</li>
</ul>
<p><strong>Acknowledgment: Thanks to Julia Kreutzer and Stefan Riezler for their valuable and much needed feedback for improving this post.</strong></p>
<p><strong>Disclaimer: This blogpost reflects solely the opinion of the author, not any of her affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.</strong></p>
<p><strong>Comments, ideas and critical views are very welcome. In particular, feel free to let us know if you think there is an important paper that we should add to this overview! We appreciate your feedback! If you want to cite this blogpost, use this <a href="/statnlpgroup/bibtex/2019-01-14_parsing_overview.bibtex">bibfile</a>.</strong></p>CarolinHow can we train semantic parsers if neither question-parse nor question-answer pairs can be collected?RL in NMT: The Good, the Bad and the Ugly2018-11-15T00:00:00+00:002018-11-15T00:00:00+00:00http://www.cl.uni-heidelberg.de/statnlpgroup/blog/rl-nmt<!--# RL in NMT: the Good, the Bad and the Ugly-->
<p><img src="/statnlpgroup/images/blog/horse_small.png" alt="image-right" class="align-right" /> Let me introduce you to three popular practices for using reinforcement learning (RL) in neural machine translation (NMT): <strong>the Good</strong>, combining it with good old maximum likelihood estimation (MLE), <strong>the Ugly</strong>, combining it with “hacks”, and <strong>the Bad</strong>, applying it with ignorance of more evolved techniques. Those three are helping NMT researchers on the hunt for BLEU scores.</p>
<p>Western movies aside, the aim of this blogpost is to take a critical look at the recent trend to include RL-inspired objectives in NMT training. We’ll start with a <a href="#the-basics">recap</a> of RL training in NMT, dive right into an empirical study by <a href="https://arxiv.org/abs/1808.08866">Wu et al. 2018</a>, leading to the discussion of the three following questions:</p>
<ol>
<li>How do <a href="#nmt-as-an-rl-problem">NMT and RL</a> fit together?</li>
<li>Why do we even get any benefits from an <a href="#rl-to-the-rescue">RL objective in supervised learning</a>?</li>
<li>Where can we find the <a href="#beyond-supervised-learning">real challenges</a>?</li>
</ol>
<p><strong>tl;dr</strong> RL is a popular first-aid method to fix supervised NMT training, but maybe not the most suitable one. RL shines outside supervised learning; new challenges and opportunities are to be found there.</p>
<h2 id="the-basics">The Basics</h2>
<p><strong>Introducing RL to incorporate rewards.</strong></p>
<h3 class="no_toc">Maximum Likelihood Estimation</h3>
<p>Standard auto-regressive NMT models, parametrized by a neural network with parameters <script type="math/tex">\theta</script>, are trained with <strong>maximum likelihood estimation</strong> on parallel data <script type="math/tex">(x, y) \in \mathcal{D}</script> resulting in the popular cross-entropy objective:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align} \text{MLE} &= \sum_{(x,y) \in D} \log p_{\theta}(y \mid x) \end{align} %]]></script>
<h3 class="no_toc">Expected Reward</h3>
<p>So how does RL come into play? The idea is to introduce rewards to encourage model outputs that would obtain a high reward, not only the perfect reference translation (=MLE). In practice, rewards can be simulated with e.g., sentence-level BLEU scores, to reinforce samples that – if evaluated – would obtain a high BLEU score. You might ask yourself, why is it even necessary? We’ll discuss that in <a href="#nmt-as-an-rl-problem">a bit</a>. Assuming the existence of such scalar rewards obtained from <script type="math/tex">R: \mathcal{Y} \to [0,1]</script> we can formulate an objective that aims to maximize the <strong>expected reward</strong> for all model outputs:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align} \text{RL} &= \mathbb{E}_{p_{\theta}(y \mid x)} \left[ R(y) \right] \end{align} %]]></script>
<h3 class="no_toc">Policy Gradient</h3>
<p>In contrast to the MLE objective, the RL objective is not differentiable with respect to <script type="math/tex">\theta</script>, because the reward is a discrete function of the outputs of the model. Luckily, with the help of the <a href="http://blog.shakirm.com/2015/11/machine-learning-trick-of-the-day-5-log-derivative-trick/"><strong>log-derivative trick</strong></a>, we can reformulate the gradient for this objective, also referred to as the <strong>policy gradient</strong>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align} \nabla_{\theta} \text{RL} &= \mathbb{E}_{p_{\theta}(y \mid x)} \left[ R(y) \nabla_{\theta} \log p_{\theta}(y \mid x)\right] \end{align} %]]></script>
<p>We can now empirically <strong>estimate</strong> the gradient with e.g. Monte Carlo sampling and train our model with stochastic gradient ascent. This solution was introduced in the famous <strong>REINFORCE</strong> algorithm by <a href="http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf">J. Williams 1992</a>. REINFORCE proposes to estimate the gradient with one sample for each input:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align} \tilde{\nabla}_{\theta} \text{RL} &= R(\tilde{y}) \nabla_{\theta} \log p_{\theta}(\tilde{y} \mid x),& \tilde{y} \sim p_{\theta}(y \mid x) \end{align} %]]></script>
<p>How does this bring us to RL? In RL, more precisely in policy search, <script type="math/tex">p_{\theta}</script> is a policy that predicts actions <script type="math/tex">y</script>. The policy chooses one action and then receives a reward for this action from the environment. Importantly, it is not possible to go back and try other actions instead and compare their rewards. In a genuine RL setup, we are limited to <strong>single-sample</strong> estimates.</p>
<h3 class="no_toc">Training</h3>
<p>The current practice in NMT is to approximate the policy gradient with either multinomial sampling from the softmax-normalized outputs of the NMT model, or by beam search.
The two objectives are trained either sequentially (e.g., supervised pre-training before reinforced fine-tuning, or alternating batches) or simultaneously (e.g., by linear interpolation).</p>
<h2 id="discussing-recent-trends-of-rl-in-nmt">Discussing Recent Trends of RL in NMT</h2>
<p><strong>If we care about BLEU, RL alone won’t help.</strong></p>
<p>In the recent EMNLP paper “A Study of RL for NMT” <a href="https://arxiv.org/abs/1808.08866">Wu et al. 18</a> observe that RL-inspired training objectives have been shown to improve NMT quality, but usually don’t come without tricks and rather weak baselines. Their question is now: Combining various variants of these tricks with learning from monolingual data, does RL still shine as expected?</p>
<p>To spoil the suspense right away, the study finds that using RL leads to marginal improvements over well-tuned baselines, also in combination with MLE and monolingual data (<strong>the good</strong>). However, the largest portions of improvement come from leveraging additional monolingual data (old news) (<strong>the ugly</strong>). But the RL-inspired approaches evaluated here lack comparisons to more evolved techniques, and assume access to reference translations (<strong>the bad</strong>). Let’s take a closer look!</p>
<h3 class="no_toc">RL Tricks</h3>
<h4 class="no_toc">Variance Reduction</h4>
<p>The variance of the <a href="#policy-gradient">gradient estimator</a> <strong>can</strong> be a problem for optimization, i.e. slow down or hinder convergence. The paper investigates the following solutions:</p>
<ul>
<li>Average reward baseline: Instead of using the reward directly, subtract its empirical average from the reward obtained.</li>
<li>Learned baseline: Subtract a learned reward instead of the empirical average. The learned reward is the output of a regression model, e.g. another neural network.</li>
</ul>
<p>The baseline was actually already proposed in the original REINFORCE paper and can be interpreted as an additive control variate (<a href="https://www.elsevier.com/books/simulation/ross/978-0-12-415825-2">Ross 2013</a>).
Actor-critic (AC) approaches go a step further and replace the reward obtained by the environment by a reward given by a critic that is trained to imitate the original reward (applied to NMT by e.g. <a href="https://arxiv.org/abs/1607.07086">Bahdanau et al. 2017</a>, <a href="https://arxiv.org/abs/1707.07402">Ngyuen et al. 2017</a>).</p>
<p>Despite the reported effectiveness in practice, <a href="http://jmlr.csail.mit.edu/papers/volume5/greensmith04a/greensmith04a.pdf">Greensmith et al. 2004</a> showed that both above solutions are suboptimal and that one can actually <strong>learn</strong> an optimal baseline with minimal variance.</p>
<p>One important aspect that has been completely neglected in the present study is that the number of samples used for the Monte Carlo gradient estimate has an essential influence of the variance of the gradient. If rewards are simulated anyway, e.g., from references using sentence-level BLEU, why not sample multiple times and average the gradients over this subset? This may sound familiar, since this is exactly what is done in <strong>minimum risk training</strong> (proposed for NMT by <a href="https://arxiv.org/abs/1512.02433">Shen et al. 2016</a>).</p>
<p>In <a href="https://arxiv.org/abs/1808.08866">Wu et al. 2018</a>’s empirical study, there was no beneficial effect observed when using the learned baseline. This contradicts with the experience in e.g., <a href="https://arxiv.org/abs/1607.07086">Bahdanau et al. 2017</a> and <a href="https://arxiv.org/abs/1704.06497">Kreutzer et al. 2017</a>. The conclusion that reward baselines are not necessary from “the economic perspective” (<a href="https://arxiv.org/abs/1808.08866">Wu et al. 2018</a>) might be a bit overhasty.</p>
<h4 class="no_toc">Reward Shaping</h4>
<p>If the reward is only obtained at the end of each sequence (here: translation), how does the model know where the errors are? The problem of <a href="https://scholarworks.umass.edu/dissertations/AAI8410337/">credit assignment</a> is often addressed by introducing methods for reward shaping (<a href="https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf">Ng et al. 1999</a>).
<a href="https://arxiv.org/abs/1808.08866">Wu et al. 2018</a> investigate the implementation by <a href="https://arxiv.org/abs/1607.07086">Bahdanau et al. 2017</a>: For each element of the output, the individual reward is the difference between the BLEU score for the partial output including and the BLEU score for the partial output excluding the element: <script type="math/tex">R(y_t) = R(y_{1:t}) - R(y_{1:t-1})</script>. Note that the BLEU scores are computed with respect to the full reference output. Once again the references are exploited to <strong>simulate</strong> the rewards.</p>
<p>But does this even address the original problem of credit assignment? The problem arose because we had to wait for rewards from the environment until we completed a sequence of actions (in NMT: produced a complete translation). As soon as <strong>references</strong> are used, we are in principle not restricted to delayed rewards anymore. One could for example compare each word in the translation to the words in the reference translation and then come up with a token-based reward. Simple binary rewards were for example proposed in <a href="https://arxiv.org/abs/1806.07169">Petrushkov et al. 2018</a>.</p>
<p>As long as we simulate the rewards using references, we can cheat our way around the real problem.
When references are not available and you simply cannot compute BLEU scores for any arbitrary, partial translation – what would you do?</p>
<p>To this end <a href="https://arxiv.org/abs/1707.07402">Ngyuen et al. 2017</a> adopt the advantage-actor critic (A2C) framework (<a href="http://proceedings.mlr.press/v48/mniha16.pdf">Mnih et al. 2016</a>). A critic network predicts the expected future reward for each element, although the reward from the environment (here: BLEU) is only obtained at the end of the sequence. Unfortunately, the latter study does not include a comparison to RL approaches without reward shaping.</p>
<p>The empirical gains from reward shaping reported in <a href="https://arxiv.org/abs/1808.08866">Wu et al. 2018</a>’s study are diminishingly small which leaves the question of the usefulness of this method unanswered.</p>
<h3 class="no_toc">Using Monolingual Data</h3>
<h4 class="no_toc">Target-side</h4>
<p>Leveraging monolingual data for improving MT systems has become increasingly popular, since simple methods have shown to be very effective for NMT. When target-side monolingual data is available, the trick-of-the-trade is to use back-translation as demonstrated in <a href="https://arxiv.org/abs/1511.06709">Sennrich et al. 2016</a>. The only burden here is here one has to train a system in the opposite translation direction. This system can then generate <strong>pseudo-sources</strong> for the available target data. The <a href="https://arxiv.org/abs/1806.04402">“hallucinated”</a> parallel data can then be used for standard training, with simulated rewards or without.</p>
<p>But isn’t it problematic to feed the NMT with fake data? Apparently not, at least as long as the targets are intact. <a href="https://arxiv.org/pdf/1808.09381.pdf">Edunov et al. 2018a</a> investigate this question systematically and surprisingly find that models get even better when the pseudo-sources are of low quality (not for small data, though). They hypothesize that the noise introduced actually enriches the training data and helps learning as e.g., in <a href="https://dl.acm.org/citation.cfm?id=1390294">denoising auto-encoders</a>.</p>
<h4 class="no_toc">Source-side</h4>
<p><a href="https://arxiv.org/abs/1808.08866">Wu et al. 2018</a> propose to leverage not only target-side monolingual data, but also source-side monolingual data.
Evoking techniques developed in the context of <a href="http://ruder.io/semi-supervised/index.html#selftraining">self-training</a>, the idea is to let the model generate <strong>pseudo-targets</strong> for its own training. We have to assume that it is able to generate targets that are “good enough”, in the sense that the model can bootstrap itself. In practice, this is addressed by using beam search decoding for generating translations that are likely to have higher quality than sampled or greedy decoded targets.</p>
<p>Does the quality of the pseudo-targets matter? When they are part of the RL objective, they are only used to simulate rewards for sampled translations, which perhaps can absorb some of the noise. In supervised MLE training <a href="https://arxiv.org/abs/1808.08866">Wu et al. 2018</a> add them to the much larger original parallel data – the small amount of extra noise might be negligible. However, this has not been investigated systematically.</p>
<h2 id="nmt-as-an-rl-problem">NMT as an RL problem</h2>
<p><strong>We only (mis-)use a subset of RL methods in NMT.</strong></p>
<p>The “Study of RL in NMT” is limited to a very specific scenario where policy gradient is used for fine-tuning of well-trained models. What about other RL algorithms? RL researchers have in fact dealt with reinforced objectives as above for decades and have developed more sophisticated training algorithms (such as <a href="https://arxiv.org/abs/1502.05477">Trust Region Policy Optimization</a> and <a href="https://arxiv.org/abs/1707.06347">Proximal Policy Optimization</a>) than vanilla policy gradient. But that’s to be discussed in another blog post.
Nevertheless, so far only policy gradient and actor-critic have become really popular for structured prediction tasks. So what’s wrong, are we just slow in adopting their algorithms?</p>
<p>In fact, it is not trivial to cast NMT, or more general structured prediction, as a standard (PO)MDP problem which is the basis for most RL algorithms: What is the environment? What is the state? Where does the reward come from? Translation researchers don’t agree on it (comparing e.g. definitions in <a href="https://arxiv.org/abs/1808.08866">Wu et al. 2018</a>, <a href="https://arxiv.org/abs/1707.07402">Ngyuen et al. 2017</a>, <a href="https://arxiv.org/abs/1607.07086">Bahdanau et al. 2017</a>). It is in fact often more suitable to cast it as a simpler contextual bandit problem, aka <strong>bandit structured prediction</strong> (e.g., <a href="http://papers.nips.cc/paper/6133-stochastic-structured-prediction-under-bandit-feedback">Sokolov et al. 2016</a>, <a href="https://arxiv.org/abs/1704.06497">Kreutzer et al. 2017</a>, <a href="https://openreview.net/forum?id=HJNMYceCW">Daumé III et al. 2018</a>), as Hal Daumé III discussed in <a href="https://nlpers.blogspot.com/2017/04/structured-prediction-is-not-rl.html">his blogpost on structured prediction and RL</a> – you may see it as a one-state MDP.</p>
<p>What we can agree on is that in NMT we’re dealing with large and structured action spaces, where actions are discrete and rewards are sparse (and most of the time delayed) and potentially noisy. This calls for algorithms that are particularly suited for those methods, but neither REINFORCE nor AC address these issues in particular.</p>
<p>In fact, training NMT from scratch with pure RL objectives, i.e. <a href="https://arxiv.org/abs/1709.09346"><strong>cold-start</strong> RL</a>, has so far not succeeded (despite <a href="https://arxiv.org/abs/1611.00179">Xia et al. 2016</a>’s optimism).</p>
<h2 id="rl-to-the-rescue">RL to the rescue?</h2>
<p><strong>RL can improve NMT because it fixes problems of our standard objective.</strong></p>
<p>What’s wrong with MLE training for NMT? <a href="https://arxiv.org/abs/1511.06732">Ranzato et al. 2016</a> elaborated on this when proposing the MIXER algorithm that mixes policy gradient-style updates with MLE. They identify the following problems:</p>
<ul>
<li>Exposure bias: During training reference targets are fed to the model (=teacher forcing), while during inference the model has to produce outputs based on its own previous outputs.</li>
<li>Token-level objective (aka “loss-evaluation mismatch” in <a href="https://arxiv.org/pdf/1606.02960.pdf">Wiseman and Rush 2016</a>): In standard autoregressive NMT models, the sequence-level <a href="#maximum-likelihood-estimation">log-likelihood</a> is decomposed as sum over token-level log-likelihoods. The training hence is optimized to finding the next perfect output token given the previous perfect token. During inference, however, we’re measuring the model’s quality with metrics like BLEU that evaluate whole sequences of outputs.</li>
</ul>
<p>Algorithms like scheduled sampling (<a href="https://arxiv.org/abs/1506.03099">Bengio et al. 2015</a>), DAgger (<a href="https://arxiv.org/abs/1011.0686">Ross et al. 2011</a>) and DAD (<a href="https://www.ri.cmu.edu/pub_files/2015/1/Venkatraman.pdf">Venkatraman et al. 2015</a>) have been designed to reduce the exposure bias by gradually exposing the model with its own outputs during training (<strong>imitation learning</strong>).</p>
<p>The same effect is obtained when including some policy gradient in the training objective (e.g. in MIXER, MRT), since the gradient update is based on the log-likelihood of the model’s own output. It is directly optimized towards a sentence-level reward that is closer to the corpus BLEU we’re evaluating our models with. Furthermore, it can help with other non-differentiable objectives than the <a href="#expected-reward">expected reward</a>, e.g., for adversarial training (<a href="https://arxiv.org/pdf/1704.06933.pdf">Wu et al. 2017</a>, <a href="https://arxiv.org/abs/1703.04887">Yang et al. 2017</a>. Or you might just use it to teach the NMT system what you actually want from it (beyond generating translations close to the reference), e.g., copying certain words of the input <a href="https://arxiv.org/pdf/1809.03182.pdf">Pham et al. 2018</a>.</p>
<p>Large gains using RL have been reported under domain shift, i.e., gains over baseline models that are not fine-tuned on the evaluation domain (e.g., <a href="https://arxiv.org/abs/1704.06497">Kreutzer et al. 2017</a>, <a href="https://arxiv.org/abs/1806.07169">Petrushkov et al. 2018</a>) or when combined with classic objectives (e.g., <a href="https://arxiv.org/abs/1609.08144">Wu et al. 2016</a>, <a href="https://arxiv.org/abs/1511.06732">Ranzato et al. 2016</a>). The above discussed paper demonstrates that without these factors, expected improvements vanish.</p>
<p>Most commonly, RL is exploited as a first aid for obvious MLE problems, in a fully-supervised setting where references are available and rewards are <strong>simulated</strong>.
Why not use (or at least compare against) other training strategies that may be better suited for NMT and fix the above problems equally, as proposed e.g., in <a href="https://arxiv.org/abs/1711.04956">Edunov et al. 2018b</a>, <a href="https://arxiv.org/abs/1512.02433">Shen et al. 2016</a> and <a href="https://arxiv.org/abs/1609.00150">Norouzi et al. 2016</a>?</p>
<h2 id="beyond-supervised-learning">Beyond supervised learning</h2>
<p><strong>The challenges in RL for NLP lie outside supervised learning.</strong></p>
<p>So what about more realistic uses of RL, e.g., where rewards cannot simply be simulated, or reward signals are not given as well-defined functions, or not available in unlimited amounts? In NLP, the following scenarios are evident:</p>
<ul>
<li>Gold standard structures may not be available because of the <strong>cost or the lack of expertise</strong> of human annotators. Weaker signals such as human judgments on the quality of output structures may be easier to obtain and may require less expertise. This is the case for example in semantic parsing (<a href="https://arxiv.org/abs/1805.01252">Lawrence at al. 2018</a>) or in machine translation (<a href="https://arxiv.org/abs/1805.10627">Kreutzer et al. 2018b</a>).</li>
<li>In genuinely <strong>interactive</strong> settings where a system directly interacts with a human, the human responses can be interpreted as a weak signal how to further improve the system. A prime example is dialogue, where learning from human feedback has successfully been implemented to train systems e.g., for small-talk (<a href="https://arxiv.org/abs/1709.02349">Serban et al. 2017</a>) and task-oriented dialogue (<a href="https://arxiv.org/abs/1606.02689">Su et al. 2016</a>).</li>
<li>Systems that need to be heavily <strong>customized</strong> towards a user or domain. User preferences or ratings (that usually come for free) can be used to specifically adapt the system. In industrial settings, large-scale collections of feedback have been utilised in personalized news recommendation (<a href="https://dl.acm.org/citation.cfm?doid=1772690.1772758">Li et al. 2010</a>) or e-commerce translations systems (<a href="https://arxiv.org/abs/1804.05958">Kreutzer et al. 2018a</a>).</li>
</ul>
<p>These scenarios bring challenges that can only partly be addressed by simulations and arise from the interaction with humans in real-life scenarios.
The human factor entails several differences to the popular simulation scenarios of RL. Firstly, human rewards are not well-defined functions, but complex and inconsistent signals. Secondly, humans cannot be expected to provide feedback for unlimited amounts of outputs. Exciting challenges (<a href="https://www.alexirpan.com/2018/02/14/rl-hard.html">“RL is hard”</a>) like the collection of reliable feedback, building robustness against adversarial feedback, fair evaluation, and off-policy learning, are ahead of us!</p>
<p>So instead of asking the question “How to get high BLEU with RL-objectives?” let’s move to “How to learn from rewards with RL when we depend on them?”.</p>
<p><strong>Acknowledgment: Thanks to Carolin Lawrence, Stefan Riezler and Joost Bastings for their valuable and much needed feedback for improving this post.</strong></p>
<p><strong>Disclaimer: This blogpost reflects solely the opinion of the author, not any of her affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.</strong></p>
<p><strong>Comments, ideas and critical views are very welcome. We appreciate your feedback! If you want to cite this blogpost, use this <a href="/statnlpgroup/bibtex/rl-nmt.bibtex">bibfile</a>.</strong></p>JuliaDiscussing good, bad and ugly practices of reinforcement learning in neural machine translation.Taming Wild Reward Functions: The Score Function Gradient Estimator Trick2018-11-12T00:00:00+00:002018-11-12T00:00:00+00:00http://www.cl.uni-heidelberg.de/statnlpgroup/blog/score-function<!--# Taming Wild Reward Functions: The Score Function Gradient Estimator Trick-->
<p>MLE is often not enough to train sequence-to-sequence neural networks in NLP. Instead we employ an external metric, which is a reward function that can help us judge model outputs. The parameters of the network are then updated on the basis of the model outputs and corresponding rewards.</p>
<p>For this update, it is necessary to obtain a derivative.</p>
<p>But how can we do this, if the external function is unknown or cannot be derived?</p>
<p><strong>Enter:</strong> The score function gradient estimator trick.</p>
<p><img src="/statnlpgroup/images/blog/luke.jpg" alt="" /></p>
<h2 id="why-mle-is-not-enough">Why MLE is not Enough</h2>
<p>Traditionally, neural networks are trained using Maximum Likelihood Estimation (MLE): given an input sequence <script type="math/tex">x = x_1, x_2, \dots x_{ \mid x \mid }</script> and a corresponding gold target sequence <script type="math/tex">\bar{y} = \bar{y}_1, \bar{y}_2, \dots \bar{y}_{ \mid \bar{y} \mid }</script> , we want to increase the probability that the current model <script type="math/tex">\pi</script> with parameters <script type="math/tex">w</script> assigns for the pair <script type="math/tex">(x,\bar{y})</script> . This gives the following loss function:</p>
<script type="math/tex; mode=display">% <![CDATA[
\mathcal{L}_{MLE} = - \sum_{j=1}^{ \mid \bar{y} \mid } \log \pi_w(\bar{y}_{j} \mid \bar{y}_{<j}, x), %]]></script>
<p>where <script type="math/tex">% <![CDATA[
\bar{y}_{<j} = \bar{y}_1, \bar{y}_2, \dots \bar{y}_{j-1}. %]]></script></p>
<p>The parameters <script type="math/tex">w</script> of <script type="math/tex">\pi</script> are then updated using stochastic gradient descent,</p>
<script type="math/tex; mode=display">w = w - \eta \nabla_w \mathcal{L}_{MLE}.</script>
<p>But there are various issues with using MLE that has led researchers to explore alternative objectives. Let’s looks at them next.</p>
<h3 id="1-gold-targets-bary-are-not-available">1. Gold targets <script type="math/tex">\bar{y}</script> are not Available</h3>
<p>This is most prominently the case in many domains of semantic parsing for question-answering, where questions <script type="math/tex">x</script> are mapped to a semantic parse <script type="math/tex">y</script>, which can be executed to obtain an answer <script type="math/tex">a</script> . For many domains, it is easier to collect question-answer pairs, rather than question-parse pairs (e.g. see <a href="http://www.aclweb.org/anthology/D/D13/D13-1160.pdf">Berant et al. 2013</a>). But with no gold parses available, MLE cannot be applied.</p>
<p>What can we do instead?</p>
<p>The current model produces a set of likely parses (e.g. by sampling from the model distribution or by employing beam search). Each parse is then executed to obtain an answer. Next, we compare the answer to the gold answer to get a reward <script type="math/tex">\delta</script> . Generally, we have <script type="math/tex">\delta=0</script> if there is no overlap between answer and gold answer and <script type="math/tex">\delta=1</script> if they match exactly. With this, we can update the model’s parameters.</p>
<h3 id="2-exposure-bias-ranzato-et-al-2016">2. Exposure Bias: <a href="https://arxiv.org/pdf/1511.06732.pdf">Ranzato et al. 2016</a></h3>
<p>During traditional MLE training the model is fed the perfect tokens from the available gold target <script type="math/tex">\bar{y}</script> , but at test time the output sequence is produced on the basis of the model distribution. This causes a distribution mismatch and inferior performance.</p>
<p>How can we reduce this mismatch?</p>
<p>Instead, we can feed model output sequences already at training time. Typically, once an entire output sequence has been produced, this sequence is judged by an external metric and the resulting reward <script type="math/tex">\delta</script> can be used as feedback to update the model’s parameters.</p>
<h3 id="3-loss-evaluation-mismatch-wiseman--rush-2016">3. Loss-Evaluation Mismatch: <a href="http://www.aclweb.org/anthology/D16-1137">Wiseman & Rush 2016</a></h3>
<p>MLE is agnostic to the final evaluation metric. Ideally we would like to have the final evaluation metric in the objective used at training time, so that the parameters of the model are specifically tuned to perform well on the intended task.</p>
<p>How can we do that?</p>
<p>Similar to problem (2.), we can feed model output sequences at training time. In this case the external metric is the final evaluation metric. For example, in the case of machine translation, typically a per-sentence approximation of the BLEU score is used.</p>
<h2 id="maximise-the-expected-reward-obtained-for-model-outputs">Maximise the Expected Reward Obtained for Model Outputs</h2>
<p>To solve all three problems, we can instead maximise the expected reward <script type="math/tex">\delta</script> or, equivalently, minimise the expected risk <script type="math/tex">-\delta</script> . This can be formulated as the following expectation:</p>
<script type="math/tex; mode=display">\mathcal{L}_\delta = \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [-\delta],</script>
<p>where <script type="math/tex">p(x)</script> is the probability distribution over inputs <script type="math/tex">x</script> and <script type="math/tex">\pi_w(y \mid x)</script> is the probability distribution over outputs <script type="math/tex">y</script> given <script type="math/tex">x</script> .</p>
<p>In praxis, this expectation has to be approximated. For example, using Monte-Carlo sampling leads to the REINFORCE algorithm (<a href="http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf">Williams 1992</a>): we sample one output <script type="math/tex">y</script> from the model distribution <script type="math/tex">\pi_w(y \mid x)</script> (see also Chapter 13 of <a href="https://drive.google.com/file/d/1opPSz5AZ_kVa1uWOdOiveNiBFiEOHjkG/view">Sutton & Barto 2018</a>).
Approximating the expectation over <script type="math/tex">y</script>, the actual training objective becomes:</p>
<script type="math/tex; mode=display">\mathcal{L}_{REINFORCE} = - \delta \pi_w(y \mid x) \approx \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [-\delta].</script>
<p>The goal of this objective is to increase the probability of an output proportionally to its reward. The gradient of the REINFORCE objective is an unbiased estimate of the gradient of the <script type="math/tex">\mathcal{L}_\delta</script> objective.</p>
<p>Alternatively, we can use Minimum Risk Training (MRT) (<a href="http://aclweb.org/anthology/P06-2101">Smith & Eisner ‘06</a>, <a href="anthology.aclweb.org/P/P16/P16-1159.pdf">Shen et al. 2016</a>). Here, several outputs are sampled from the model distribution. This stabilises learning, but requires that more outputs are evaluated to get corresponding rewards. Assuming <script type="math/tex">S</script> sampled outputs, the objective then takes the following form:</p>
<script type="math/tex; mode=display">\mathcal{L}_{MRT} = - \frac{1}{S} \sum_{s=1}^{S} \delta_s \pi_w(y_s \mid x) \approx \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [-\delta].</script>
<p>Due to sampling, both approaches can suffer from high variance, which can be combatted using control variates (see for example Chapter 9 of <a href="https://www.elsevier.com/books/simulation/ross/978-0-12-415825-2">Ross 2013</a>).</p>
<h2 id="the-problem-the-reward-function-cannot-be-derived">The Problem: The Reward Function cannot be Derived</h2>
<p>To minimize <script type="math/tex">\mathcal{L}_{\delta}</script> with stochastic gradient descent, it is necessary to calculate <script type="math/tex">\nabla_w \mathcal{L}_{\delta}</script> , also called the policy gradient in Reinforcement Learning (RL) terms.</p>
<p>But in praxis, the rewards <script type="math/tex">\delta</script> are typically either from an unknown function (e.g. if rewards are collected from human users) or the underlying function cannot be derived (e.g. in the case of BLEU).</p>
<p>As such, it is not immediately clear how to derive <script type="math/tex">\mathcal{L}_{\delta}</script> , i.e. how to calculate
<script type="math/tex">\nabla_w\mathcal{L}_{\delta}.</script></p>
<h2 id="the-solution-score-function-gradient-estimator">The Solution: Score Function Gradient Estimator</h2>
<p>To be able to calculate <script type="math/tex">\nabla_w\mathcal{L}_{\delta}</script> , we use two tricks:</p>
<h3 id="1-the-log-derivative-trick">1. The <script type="math/tex">\log</script> Derivative Trick</h3>
<p>The derivative of the logarithm is:</p>
<script type="math/tex; mode=display">\nabla_w \log f = \frac{\nabla_w f}{f}.</script>
<h3 id="2-the-identity-trick">2. The Identity Trick</h3>
<script type="math/tex; mode=display">f = \frac{g}{g} f</script>
<p>Now we can formulate what is known as the score function gradient estimator (<a href="https://www.sciencedirect.com/science/article/abs/pii/S0927050706130194">Fu ‘06</a>):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align} \nabla_w \mathcal{L}_\delta &= \nabla_w \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [- \delta] & (1) \\
&= \nabla_w \int_{x} \int_{y} -\delta \, \cdot \, p(x)\textrm{d}x \, \cdot \, \pi_w(y \mid x)\textrm{d}y & (2) \\
&= \int_{x} \int_{y} -\delta \, \cdot \, p(x)\textrm{d}x \, \cdot \, \nabla_w\pi_w(y \mid x)\textrm{d}y & (3) \\
&= \int_{x} \int_{y} -\delta \, \cdot \, p(x)\textrm{d}x \, \cdot \, \nabla_w \frac{\pi_w(y \mid x)}{\pi_w(y \mid x)} \, \cdot \, \pi_w(y \mid x)\textrm{d}y & (4) \\
&= \int_{x} \int_{y} -\delta \, \cdot \, p(x)\textrm{d}x \, \cdot \, \nabla_w \log \pi_w(y \mid x) \, \cdot \, \pi_w(y \mid x)\textrm{d}y & (5) \\
&= \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [-\delta \nabla_w \log \pi_w(y \mid x)]. & (6) \end{align} %]]></script>
<p>Let’s investigate for each line what happened:</p>
<ul>
<li>(2): The expectation is expanded into two integrals. <script type="math/tex">\mathbb{E}_{p(x)}</script> becomes <script type="math/tex">\int_{x} \dots p(x)\textrm{d}x</script> and <script type="math/tex">\mathbb{E}_{\pi_w(y \mid x)}</script> turns into <script type="math/tex">\int_{y} \dots \pi_w(y \mid x)\textrm{d}y</script> .</li>
<li>(3): Integral and differentiation can be switched, so we move <script type="math/tex">\nabla_w</script> in front of <script type="math/tex">\pi_w(y \mid x)</script> because <script type="math/tex">\pi_w(y \mid x)</script> is the only term dependent on <script type="math/tex">w</script> .</li>
<li>(4): We use the identity trick with <script type="math/tex">g = \pi_w(y \mid x)</script> .</li>
<li>(5): We use the <script type="math/tex">\log</script> derivative trick.</li>
<li>(6): We still have <script type="math/tex">\pi_w(y \mid x)\textrm{d}y</script> available. With this, we can transform the expression back into an expectation. But in contrast to before, we now have <script type="math/tex">\nabla_w \log \pi_w(y \mid x)</script> and this derivative is simply scaled by <script type="math/tex">\delta</script> .</li>
</ul>
<p><strong><script type="math/tex">\rightarrow</script> We no longer need to know what the function that produces <script type="math/tex">\delta</script> looks like or derive it.</strong></p>
<p>For an alternative view on the subject, also see <a href="http://blog.shakirm.com/2015/11/machine-learning-trick-of-the-day-5-log-derivative-trick/">this great blog post</a>.</p>
<h2 id="when-can-it-be-applied">When can it be applied?</h2>
<p>The score function gradient estimator can be applied independent of the underlying model, as long as it has a derivative.</p>
<p>E.g. if <script type="math/tex">\pi_w(y \mid x)</script> is a log-linear model with feature vectors <script type="math/tex">\phi(x,y)</script> ,</p>
<script type="math/tex; mode=display">\pi_w(y \mid x) = \frac{e^{ w \phi(x,y)}}{\sum_{y\in \mathbf{Y}(x)} e^{ w \phi(x, y)}},</script>
<p>then the derivative would be</p>
<script type="math/tex; mode=display">\nabla \log \pi_w(y \mid x) = \phi(x,y) - \sum_{y\in \mathbf{Y}(x)} \phi(x, y)\pi_w(y \mid x).</script>
<p>In the case of neural networks, backpropogation is applied to derive <script type="math/tex">\nabla_w \pi_w(y \mid x)</script> (see for example Chapter 3 of <a href="https://arxiv.org/abs/1511.07916">Cho 2015</a>).</p>
<h2 id="lessons-learnt">Lessons Learnt</h2>
<ul>
<li>MLE can sometimes not be applied or cause inferior performance.</li>
<li>Instead, we leverage rewards from an external metric that evaluates the quality of our model ouputs.</li>
<li>The metric might be unknown or cannot be derived: (stochastic) gradient descent cannot be applied directly.</li>
<li>The score function gradient estimator helps us side-step this problem.</li>
</ul>
<hr />
<p><strong>Acknowledgment: Thanks to Julia Kreutzer for her valuable and much needed feedback for improving this post.</strong></p>
<p><strong>Disclaimer: This blogpost reflects solely the opinion of the author, not any of her affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.</strong></p>
<p><strong>Comments, ideas and critical views are very welcome. We appreciate your feedback! If you want to cite this blogpost, use this <a href="/statnlpgroup/bibtex/2018-11-12_score_function.bibtex">bibfile</a>.</strong></p>CarolinThis post explains the need for the score function gradient estimator trick and how it works.