The Real Challenge of Real-World Reinforcement Learning: The Human Factor

The full potential of reinforcement learning requires reinforcement learning agents to be embedded into the flow of real-world experience, where they act, explore, and learn in our world, not just in their worlds. (Sutton & Barto (2018). Reinforcement Learning. An Introduction. 2nd edition)

Recent well-recognized research has shown that artificial intelligence agents can achieve human-like or even superhuman performance in playing Atari games (Mnih et al. 2015), or the game of Go (Silver et al. 2016), without human supervision (Silver et al. 2017), but instead using reinforcement learning techniques for many rounds of self-play. This is a huge achievement in artificial intelligence research, opening the doors for applications where supervised learning is (too) costly, and with ramifications for many other application areas beyond gaming. The question arises how to transfer the superhuman achievements of RL agents under clean room conditions like gaming (where reward signals are well-defined and abundant) to real-world environments with all their shortcomings, first and foremost, the shortcomings of human teachers (who obviously would not pass the Turing test, as indicated in the comic below).

The human factor in real-world RL for natural language processing

Let us have a look at human learning scenarios, for example, natural language translation: A human student of translation and interpretation studies has to learn how to produce correct translations from a mix of feedback types. The human teacher will provide supervision signals in form of gold standard translations in some cases. However, in most cases the student has to learn from weaker teacher feedback that signals how well the student accomplished the task, without knowing what would have happened if the student had produced a different translation, nor what the correct translation should look like. In addition, the best students will become like teachers in that they acquire a repertoire of strategies to self-regulate their learning process (Hattie and Timperley 2007).

Now, if our goal is to build an artificial intelligence agent that learns to translate like a human student, in interaction with a professional human translator acting as the teacher, we see the same pattern of a cost-effectiveness tradeoff: The human translator will not want to provide a supervision signal in form of a correct translation as feedback to every translation produced by the agent, even if this signal is the most informative. Rather, in some cases weaker feedback signals on the quality of the system output, or on parts of it, are a more efficient way of student-teacher interaction. Another scenario is users of online translation systems: They act as consumers - sometimes they might give a feedback signal, but rarely a fully correct translation.

We also see a similar pattern in the quality of the teacher’s feedback signal when training a human and when training an agent: The human teacher of the human translation student and the professional translator acting as a human teacher of the artificial intelligence agent are both human: Their feedback signals can be ambiguous, misdirected, sparse, in short - only human (see the comic above). This is a stark difference to the scenarios in which the success stories of RL have been written - gaming . In these environments reward signals are unambiguous, accurate, and plentiful. One might say that the RL agents playing games against humans received an unfair advantage of an artificial environment that suits their capabilities. However, in order to replicate these success stories for RL in scenarios with learning from human feedback, we should not belittle these successes, but learn from them: The goal should be to give the RL agents that learn from human feedback any possible advantage to succeed in this difficult learning scenario. For this we have to better understand what the real challenges of learning from human feedback consist of.


In difference to previous work on learning from human reinforcement signals (see, for example, Knox and Stone, Christiano et al. 2017, Leike et al. 2018), our scenario is not one where human knowledge is used to reduce the sample complexity and thus to speed up the learning process of the system, but one where no other reward signals than human feedback are available for interactive learning. This scenario applies to many personalization scenarios where a system that is pre-trained in a supervised fashion is adapted and improved in an interactive learning setup from feedback of the human user. Examples are online advertising, or, machine translation, which we will focus on here.

Recent work (Dulac-Arnold et al. 2019) has recognized that the poorly defined realities of real-world systems are hampering the progress of real-world reinforcement learning. They address, amongst others, issues such as off-line learning, limited exploration, high-dimensional action spaces, or unspecified reward functions. These challenges are important in RL for control systems or robots grounded in the physical world, however, they severly underestimate the human factor in interactive learning. We will use their paper as a foil to address several recognized challenges in real-world RL.

Counterfactual learning under deterministic logging

One of the issues addressed in Dulac-Arnold et al. 2019 is the need for off-line or off-policy RL in applications where systems cannot be updated online. Online learning is unrealistic in commercial settings due to latency requirements and the desire for offline testing of system updates before deployment. A natural solution would be to exploit counterfactual learning that reuses logged interaction data where the predictions have been made by a historic system different from the target system.

However, both online learning and offline learning from logged data are plagued by the problem that exploration is prohibitive in commercial systems since it means to show inferior outputs to users. This effectively results in deterministic logging policies that lack explicit exploration, making an application of standard off-policy methods questionable. For example, techniques such as inverse propensity scoring (Rosenbaum and Rubin 1983), doubly-robust estimation (Dudik et al. 2011), or weighted importance sampling (Precup et al. 2000, Jiang and Li 2016, Thomas and Brunskill 2016) all rely on sufficient exploration of the output space by the logging system as a prerequisite for counterfactual learning. In fact, Langford et al. 2008 and Strehl et al. 2010 even give impossibility results for exploration-free counterfactual learning.

Clearly, standard off-policy learning does not apply when commercial systems interact safely, i.e., deterministically with human users!

So what to do? One solution is to hope for implicit exploration due to input or context variability. This has been observed for the case of online advertising (Chapelle and Li 2012) and investigated theoretically (Bastani et al. 2017). However, natural exploration is something inherent in the data, not something machine learning can optimize for.

Another solution is to consider concrete cases of degenerate behavior in estimation from deterministically logged data, and find solutions that might repeal the impossibility theorems. One such degenerate behavior consists in the fact that the empirical reward over the data log can be maximized by setting probability of all logged data to 1. However, it is clearly undesirable to increase the probability of low reward examples (Swaninathan and Joachims 2015, Lawrence et al. 2017a, Lawrence et al. 2017b). A solution to the problem, called deterministic propensity matching, has been presented by Lawrence and Riezler 2018a, Lawrence and Riezler 2018b and been tested with real human feedback in a semantic parsing scenario. The central idea is as follows: Consider logged data , where is sampled from a logging system , and the reward is obtained from a human user. One possible objective for off-line learning under deterministic logging is to maximize the expected reward of the logged data

where a multiplicative control variate (Kong 1992) is used for reweighting, evaluated one-step-late at from some previous iteration (for efficient gradient calculation), where

The effect of this self-normalization is to prevent that the probability of low reward data can be increased in learning by taking away probability mass from higher reward outputs. This introduces a bias in the estimator (that decreases as increases), however, it makes learning under deterministic logging feasible, thus giving the RL agent an edge in learning in an environment that has been deemed impossible in the literature. See also Carolin’s blog describing the semantic parsing scenario.

Learning reward estimators from human bandit feedback

Other issues addressed prominently in Dulac-Arnold et al. 2019 are the problems of learning from limited samples, in high dimensional action spaces, with unspecified reward functions. This is a concise description of the learning scenario in interactive machine translation: Firstly, it is unrealistic to expect anything else than bandit feedback from a human user using a commercial machine translation system. That is, a user of an machine translation system will only provide a reward signal to one deterministically produced best system output, and cannot be expected to rate a multitude of translations for the same input. Providers of commercial machine translation systems realize this and provide non-intrusive interfaces for user feedback that allow to post-edit translations (negative signal), or to copy and/or share the translation without changes (positive signal). Furthermore, human judgements on the quality of full translations need to cover an exponential output space, while the notion of translation quality is not a well-defined function to start with: In general every input sentence has a multitude of correct translations, each of which humans may judge differently, depending on many contextual and personal factors.

Surprisingly, the question of how to give the RL agent an advantage in learning from real-world human feedback has been scarcely researched. The suggestions in Dulac-Arnold et al. 2019 may seem straightforward - warm-starting agents to decrease sample complexity or using inverse reinforcement learning to recover reward functions from demonstrations - but they require additional supervision signals that RL was supposed to alleviate. Furthermore, when it comes to the question which type of human feedback is most beneficial for training an RL agent, one finds a lot of blanket statements referring to the advantages of pairwise comparisons to produce a scale (Thurstone 1927), however, without providing any empirical evidence.

An exception is the work of Kreutzer et al. 2018. This work is one of the first to investigate the question which type of human feedback - pairwise judgements or cardinal feedback on a 5-point scale - can be given most reliably by human teachers, and which type of feedback allows to learn reward estimators that best approximate human rewards and can be best integrated into an end-to-end RL task. Let’s look at example interfaces for 5-point feedback and pairwise judgements:

Contrary to common belief, inter-rater reliability was higher for 5-point ratings (Krippendorff’s ) than for pairwise judgements () in the study of Kreutzer et al. 2018 . They explain this by the possibility to standardize cardinal judgements for each rater to get rid of individual biases, and due to filtering out raters with low intra-rater reliability. The main problem for pairwise judgements were distinctions between similarly good or bad translations, which could be filtered out to improve intra-rater reliability, yielding the final inter-rater reliability given above.

Furthermore, when training reward estimators on judgments collected for 800 translations, they measured learnability by the correlation between estimated rewards and translation edit rate to human reference translations. They found that learnablity was better for a regression model trained on 5-point feedback than for a Bradley-Terry model trained on pairwise rankings (as recently used for RL from human preferences by Christiano et al. 2017).

Finally, and most importantly, when integrating reward estimators into an end-to-end RL task, they found that one can improve a neural machine translation system by more than 1 BLEU point by a reward estimator trained on only 800 cardinal user judgements. This is not only a promising result pointing in the direction in which future research for real-world RL could happen, but it also solves all three of the above mentioned challenges of Dulac-Arnold et al. 2019 (limited samples, high dimensional action spaces, unspecified reward functions) in one approach: Reward estimators can be trained on very small datasets, and then be integrated as reward functions over high dimensional action spaces. The idea is to tackle the arguably simpler problem of learning a reward estimator from human feedback first, then provide unlimited learned feedback to generalize to unseen outputs in off-policy RL.

Further avenues: Self-regulated interactive learning

As mentioned earlier, human students have to be able learn in situations where the most informative learning signals are the sparsest. This is because teacher feedback comes at a cost so that the most precious feedback of gold standard outputs has to be requested economically. Furthermore, students have to learn how to self-regulate their learning process and learn when to seek help and which kind of help to seek. This is different to classic RL games where the cost of feedback is negligible (we can simulate games forever), but this is not realistic in the real world, where especially exploration can get very costly (and dangerous).

Learning to self-regulate is a new research direction that tries to equip an artificial intelligence agent with a decision-making ability that is traditionally hard for humans - balancing cost and effect of learning from different types of feedback, including full supervision by teacher demonstration or correction, weak supervision in the form of positive or negative rewards for student predictions, or a self-supervision signal generated by the student.

Kreutzer and Riezler 2019 have shown how to cast self-regulation as a learning-to-learn problem that solves the above problem by making the agent aware of and manage the cost-reward trade-off. They find in simulation experiments on interactive neural machine translation that the self-regulator is a powerful alternative to uncertainty-based active learning (Settles and Craven 2008), and discovers an -greedy strategy for the optimal cost-quality trade-off by mixing different feedback types including corrections, error markups, and self-supervision. Their simulation scenario of course abstracts away from certain confounding variables to be expected in real-life interactive machine learning, however, all of these are interesting directions for new research on real-life RL with human teachers.

The appeal of RL from human feedback

I tried to show that some of the challenges in real-world RL originate from the human teachers who have been considered a help in previous work (Knox and Stone, Christiano et al. 2017, Leike et al. 2018): In situations where only the feedback of a human user is available to personalize and adapt an artificial intelligence agent, the standard tricks of memorizing large amounts of labels in supervised learning, or training in unlimited rounds of self-play with cost-free and accurate rewards in RL, won’t do the job. If we want move RL into the uncharted territories of training artificial intelligence agents from feedback of cost-aware, unfathomable human teachers, we need to make sure the agent does not depend on massive exploration, and we have to learn great models of human feedback. It will be interesting to see how and what artificial intelligence agents learn in the same information-deprived situations that human students have to deal with, and hopefully, it will lead to artificial intelligence agents that can support humans by smoothly adapting to their needs.g

Acknowledgment: Thanks to Julia Kreutzer and Carolin Lawrence for our joint work and their valuable feedback on this post.

Disclaimer: This blogpost reflects solely the opinion of the author, not any of his affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.

Comments, ideas and critical views are very welcome. We appreciate your feedback! If you want to cite this blogpost, use this bib .

    author = {Riezler, Stefan},
    title = {The Real Challenges of Real-World Reinforcement Learning: The Human Factor},
    journal = {StatNLP HD Blog},
    type = {Blog},
    number = {July},
    year = {2019},
    howpublished = {\url{}}