Response-Based and Counterfactual Learning for Sequence-to-Sequence Tasks in NLP: An Overview

“We all need people who will give us feedback. That’s how we improve.” - Bill Gates, TED Talks Education, May 2013


We all know that supervised data is expensive to obtain. So let’s ask the following question: What if we learn from feedback given to model outputs instead?

Next to reducing the requirement for supervised data, learning from feedback also has several other advantages:

  • Even if supervised data is given, we want to also discover alternative good outputs.
  • With feedback given to model outputs, we can improve over time.
  • It is possible to personalise a system to a specific use case or user.

For these reasons, I explored how to learn from feedback for sequence-to-sequence tasks in NLP in my PhD thesis.

The scenario I assume in my thesis can be summarized with the following picture:


A pre-trained model receives an input for which it produces one or several outputs. An output is grounded in a given external world which assigns some feedback to it. The feedback is then used to update the pre-trained model.


While exploring how to learn from feedback, there are three different aspects we consider in the thesis. First, we have a final application in mind: we want to build a natural language interface to the geographical database OpenStreetMap (OSM). Second, we consider two different approaches to learn from feedback, response-based on-policy learning and counterfactual off-policy learning. Third, both approaches are applied to two different tasks, semantic parsing for question answering and machine translation.


Overall, the thesis has three parts, which we now look at in turn.

In Part 1 we set up the application of building a natural language interface to OSM. Part 2 and 3 each look at one approach of learning from feedback. In both cases, the approach is applied to both tasks, semantic parsing for question-answering and machine translation. Finally, we conclude by drawing a direct comparison between both approaches.

Read on for the details or jump to the conclusion.

Part 1: A Natural Language Interface to OSM

Question-Answering Task

OpenStreetMap (OSM) is a geographical database populated by volunteers about points of interest (POI) in the world. Currently, it can only be queried with straight-forward string matching methods. But to find POIs with more complex relationships, such as “where is the hotel closest to the main station?”, it is necessary to issue a complicated database query. Because everyday users do not know how to issue such complex queries, we build a natural language interface to OSM. Here, users can ask natural language questions that are then automatically mapped to database queries. The execution of a query against the OSM database yields the corresponding answer. To achieve the automatic mapping, we built a semantic parser that learns to transform a natural language question to a database query, in this context also called a (semantic) parse.

We first collected a manually annotated corpus, NLmaps, of 2,380 question-parse pairs. This corpus was later automatically extended and NLmaps v2 contains 28,609 question-parse pairs.

Semantic Parsers

Using either corpus, allows us to train a semantic parser. For NLmaps v2, we found the best parser to be an encoder-decoder neural network with attention (based on Nematus). Additionally, named entities are handled separately. Prior to the semantic parsing step, another neural network first identifies named entities. Second, these named entities are replaced with placeholder for the semantic parsing step. Finally, the original named entity is added back into the placeholders of the parse. This led to a parser with an answer-level F1 score of about 90%.

With a semantic parser now available, we built a graphical interface for users to access the natural language interface to OSM. After entering a question, it is sent to the semantic parser, which produces a database query. The parse is then executed against the database and both a textual and a graphical answer are displayed for the user. For example, in the picture below a user asked about cuisines in Heidelberg. A list of the various cuisines is displayed and clicking on a cuisine opens pop-up information boxes on relevant markers on the map below.


If you want, try out your own questions!

Part 2: Response-Based On-Policy Learning

We now turn to the first approach to learn from feedback, response-based on-policy learning. The idea of response-based on-policy learning is to ground a model in a downstream task for which gold targets are available. A great advantage of this approach is that feedback can be obtained for arbitrarily many outputs.

Concretely, we employ a ramp loss:


In a ramp loss, a hope sequence is encouraged, while a fear sequence is discouraged. The specific instantiations are deferred to concrete tasks. But in general, a hope sequence has a high probability under the current model while receiving a high feedback score . In contrast, a fear sequence also obtains a high probability under the current model but receives a low feedback score .

Multilingual Semantic Parsing: NLmaps

For this task, we assume a semantic parser can transform English questions into OSM queries, but a user wants to ask questions in German. Thus, we first employ a machine translation system to translate the question from German into English. The goal is to adjust the machine translation system to work well in conjunction with the semantic parser. We use the ramp loss defined above and instantiate to be 1 if a machine-translated question ultimately leads to the correct answer and 0 otherwise. For an overview of the setup, see the picture below.


By using the feedback signal of the downstream semantic parsing task, we can improve a linear-model machine translation system to work better in conjunction with the semantic parser. The adjusted system achieves a higher answer-level F1 score by about 8 percentage points compared to the baseline system. This is the first example that demonstrates the effectiveness of grounding a model in a downstream task.

Question-Answering: NLmaps v2

For many question-answering tasks, it is easier to obtain gold answers rather than gold parses. Thus, it is possible to ground semantic parsers in gold answers and treat the parses as hidden. In this scenario, we can again employ the above defined ramp loss, where a semantic parse receives a feedback of if the parse leads to a correct answer and otherwise.

On this task, we employ a neural model. Because neural models produce their output token by token, we can assign feedback at the token level. This leads to a new loss function, called Ramp+T, that performs better (for more information, see Chapter 6 of the thesis).

For our experiment, we assume an initial model has been trained on 2k supervised question-parse pairs. For the remainder of the training data, only gold answers, but not gold parses are available. With our new loss function, Ramp+T, grounding the semantic parser in the gold answer, allows us to outperform the baseline model by over 12 percentage points in answer-level F1 score.

We have now successfully applied response-based on-policy learning for two tasks. However, this approach ultimately requires gold targets of a downstream task. This can still be too expensive to obtain. It is for example the case in the OSM domain, e.g. for the question “How many hotels are there in Paris?”, we cannot expect a person to count all 951 hotels in a reasonable amount of time or without error. Consequently, we next look at an approach that requires no gold targets at all.

Part 3: Counterfactual Off-Policy Learning

In the second approach to learn from feedback, counterfactual off-policy learning, we assume that a model is deployed. Users interact with the model and corresponding feedback is logged, hence the deployed model is also called the logging model. Once enough feedback is collected, the collected log can be used to improve either the logging model or any other model. With this setup, we can learn from feedback and do not require any direct or indirect gold targets. For a graphical overview see the picture below.


We update the model offline for several reasons:

  • Safety: a deployed model that is updated could degenerate without notice, leading to a bad user experience.
  • Hyperparameters: offline it is possible to do hyperparameter testing.
  • Validation: the new model can be validated on a test set before it is deployed.

While offline learning provides us with several crucial benefits, it is more challenging, because:

  • Bandit setup: feedback is only given to one output.
  • Bias: the logged output is biased towards the choice made by the logging policy.

We refer to the approach as counterfactual because we can ask the following counterfactual question: How would another model have performed if it had been in control during logging?

To employ this approach to learn from feedback, we need to collect a log with

  • : input
  • : output from logging model
  • : feedback received from user

Based on the log, counterfactual estimators can be defined to estimate the performance of another model . The model can then be updated via stochastic gradient descent (SGD), i.e. , where is a suitably set learning rate.

In previous literature, it is assumed that outputs are sampled stochastically from the logging model. This leads to the Inverse Propensity Scoring (IPS) estimator, which can correct the bias introduced by the logging model via important sampling:


However, sampling is dangerous because we are at risk of showing inferior outputs to a user, which would lead to a bad user experience. Imagine in the context of machine translation, if one samples from the model output, there is a high risk that the sampled output is not actually a correct translation. For this reason, we want to always select the most likeliest output. This leads to deterministic logging where for all instances. Consequently, the importance sampling is disabled. We refer to this estimator as Deterministic Propensity Matching (DPM):


We would now like to find out if the deterministic DPM estimator can be used instead of the stochastic IPS estimator for sequence-to-sequence tasks in NLP.

Machine Translation

To investigate whether DPM is feasible in comparison to IPS, we set up a machine translation experiment with simulated feedback. Given an out-of-domain MT system, the system translates in-domain data. To simulate feedback, we use available gold reference. This allows us to create stochastic and deterministic logs where both logs have the same feedback signal.

Both IPS and DPM suffer from high variance and can exhibit degenerative behaviour (see Chapter 7.2 in the thesis). To combat this, we add 2 control variates to each estimator, a multiplicative and an additive control variate (for an overview of control variates see the great slides by Matthew W. Hoffman. This leads to the stochastic ĉDoubly Robust (ĉDR) and the deterministic ĉDoubly Controlled (ĉDC) estimator.

We run experiments on two separate datasets and in both cases the deterministic estimator performs as well as the stochastic one. From this, we conclude that deterministic logging is viable for sequence-to-sequence NLP tasks because there is enough implicit exploration at the word level (see Chapter 7.3.4 in the thesis).

However, we still need to show that counterfactual off-policy learning is possible for sequence-to-sequence NLP tasks when the feedback is obtained from real human users. We tackle this in the next section.

Question-Answering: NLmaps v2

We noted earlier that it is difficult for some question-answering domains to obtain gold answers, e.g. in the case of the OSM domain where we, for example, can’t expect a human to count 951 hotels. As the OSM query language is relatively unknown, it is also difficult to obtain gold parses. Thus, counterfactual off-policy learning, where no gold answers are required, is particularly suitable for the OSM domain.

However, given for example the question “How many hotels are there in Paris?” and a corresponding answer, e.g. “951” or “1,003”, a human still cannot judge whether “951” or “1,003” are correct or not. To solve this issue, we instead propose to make the underlying parse human understandable. We do this by automatically converting the parse into a set of statements that can easily be judged as right or wrong. You can see what this looks like for our example in the following picture:


Once the form is filled out, we can map the individual statements back to the tokens in the parse the produced them. With this approach we collected feedback for 1 question-parse pairs from 9 humans.

For this task, we again employ a neural model. Because neural models produce their output token by token, we can assign feedback at the token level. That is particularly ideal for our situation because the feedback form already collects feedback at a token level. This leads to the new objective, called DPM+T.

The DPM+T objective does not employ a control variate, but we would like to do so to reduce variance. The multiplicative control variate, reweighting (Swaminathan and Joachims, 2015), we used previously is not applicable to stochastic minibatch learning. To be applicable, we modify this control variate, leading to a new control variate that we refer to as One-Step-Late reweighting (OSL). Together with the previous new objective, this leads to the combined objective, DPM+T+OSL (for more information, see Chapter 8 of the thesis). DPM+T+OSL is the best objective for both learning from the 1 human feedback instances as well as learning from a larger, but simulated log of 22 feedback instances.

Comparison of both learning approaches

Because we employ the same NLmaps task and the same neural network architecture for both approaches to learn from feedback, we can directly compare the two approaches.

Unsurprisingly response-based learning outperforms counterfactual learning significantly because it has a better learning signal available. Because response-based learning has a downstream gold target at hand, it can obtain feedback for arbitrarily many model outputs. Counterfactual learning instead only has access to one model output and its feedback. Furthermore, that model output is biased by the logging policy.

Ultimately, the choice between response-based and counterfactual learning reduces to how expensive it is to obtain gold targets. For example, for the OSM domain, it is impractical to obtain gold parse as well as gold answers because the parse can only be written by a handful of people and the answers are too cumbersome to derive for humans. In such a situation, obtaining feedback to model outputs from human users is a viable alternative.

If the base model is good enough, this feedback can directly be collected while real users are interacting with the system. Otherwise, another option would be to recruit human workers to provide the needed feedback.

So in conclusion: counterfactual learning should be chosen if gold targets are impossible, too time consuming or too expensive to obtain, whereas feedback for model outputs can be collected easily. Otherwise, response-based learning is the better approach because the available gold targets offer a stronger learning signal. For an overview of this, also see the following diagram:



It is a good idea to explore how to learn from feedback given to model outputs for several reasons, the primary one being that the collection of direct gold targets might be too expensive.

In my thesis, I explored two separate approaches to learn from feedback, response-based and counterfactual learning. Response-based learning assumes that indirect gold targets are available. Counterfactual learning does not require gold targets and instead saves feedback given by humans interacting with a deployed system in a log.

If (indirect) gold targets can be obtained, response-based learning is the more promising approach because the gold targets offer a stronger learning signal. However, for situation where it is not possible to collect direct or indirect gold targets, counterfactual learning offers a viable alternative.

Next to exploring how to learn from feedback, it was important to me during my PhD project to keep a concrete user application in mind. To this end, I developed a natural language interface to OpenStreetMap (OSM).

My PhD project was a long, but very rewarding journey. I learnt so much and got to join a great NLP community. Special thanks go to my supervisor, Stefan Riezler, who always encouraged my ideas and guided me to the path that led to my thesis. I also want to thank all my colleagues who were always willing to listen and offer suggestions.

If you enjoyed this post and want to discuss anything further, feel free to reach out to me via e-mail or twitter.

More information can be found in the thesis.

Acknowledgment: Thanks to Stefan Riezler and Mayumi Ohta for their valuable feedback to improve this post.

Disclaimer: This blogpost reflects solely the opinion of the author, not any of her affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.

If you want to cite this blogpost, use this bib .

    author = {Lawrence, Carolin},
    title = {Response-Based and Counterfactual Learning for Sequence-to-Sequence Tasks in NLP: An Overview},
    journal = {StatNLP HD Blog},
    type = {Blog},
    number = {August},
    year = {2019},
    howpublished = {\url{}}