Counterfactual Learning of Semantic Parsers When Even Gold Answers Are Unattainable

In semantic parsing, natural language questions are mapped to semantic parses. A semantic parse can be executed against a database to obtain an answer. This answer can then be presented to a user.

Semantic parsers for question-answering can be employed in virtual personal assistants which are increasingly on the rise in recent years. Because such assistants are desired to help on an increasing number of tasks, we need to explore the best possible options to efficiently and effectively set up a parser for a new domain, to adapt them for specific user needs and to generally ensure that they improve.

However, obtaining labelled data can be challenging. In this post, we first consider the different possible supervision signals that can be used to train a semantic parser. This influences which objectives can be used for training, which we explore in the second part.

Supervision Signal

Question-Parse Pairs

To train a semantic parser, direct supervision means the collection of question-parse pairs. This can be difficult if the parse language is only understood by expert users. One option is to ensure that the parse language is as broad as possible, e.g. by choosing SQL (Iyer et al., 2017). However, even in the case of SQL, experts are required for the annotation, which can get quickly very expensive.

Question-Answer Pairs

An alternative option is to employ a weaker supervision signal. Collecting question-answer pairs is easier for many domains (Berant et al., 2013; Iyyer et al., 2014; Yang et al., 2015; inter alia) and can typically be done by non-experts.

However, the weaker supervision signal from question-answer presents a harder learning task. While the gold answer is known, it remains unclear which parse will lead to the gold answer. During training, the parser has to explore the output space to find a parse that executes to the correct gold answer. This search can be difficult as the output space is large. Furthermore, instead of finding a parse that represents the correct meaning of the question, one might find a spurious parse instead. Such a parse happens to execute to the gold answer, but conveys the wrong meaning. This hampers generalisation.

For example, assume we have the question “Are there any bars?” and instead of mapping “bar” to the logical form for “", the parser maps it to the logical form of "restaurant" instead. If the answer for both "Are there any bars?" and "Are there any restaurants?" is "Yes", then the wrong logical form "restaurant" for the question "Are there any bars?", will lead to the correct answer. The parser has now wrongly learnt to map "bar" to the logical form "restaurant" and for other questions, such as "Where is the closest bar?" it will now return the closest restaurant instead.

Comparison: Question-Parse vs. Question-Answer Pairs

Yih et al., 2016 investigated the cost and benefit of obtaining question-parse pairs compared to collecting question-answer pairs. For this, they use the WebQuestion corpus Berant et al., 2013 which is based on the Freebase Database. The corpus was originally collected with the help of non-expert crowd-source workers in the form of question-answer pairs. Yih et al., 2016 annotate each question in the corpus with corresponding gold parses. To ease the annotation, they designed a simple user interface and hired experts familiar with Freebase.

Next, they compared a system trained on question-parse pairs to a system trained on question-answer pairs. In their experiments, they were able to show three, in part surprising, results:

  1. The model trained on question-parse pairs outperforms the model on question-answer pairs by over 5 percentage points in answer accuracy.

  2. Answer annotation by crowd-source workers is often incorrect, in their evaluation it was incorrect 34\% of the time.

  3. With an easy to use interface, experts can write the correct semantic parse faster than they can retrieve the correct answer.

Observation 1. does not come as a surprise as question-parse pairs offer a stronger learning signal. But both 2. and 3. are surprising. However, as noted previously, hiring experts to annotate gold parses can be expensive.

A further problem arises for domains where it is not easy to collect gold answers. For example, when answers are open-ended lists, fuzzily defined or very large.

This is for example the case on the domain of geographical question-answering using the OpenStreetMap database. Here, the underlying parse language is only known to a few expert users, which makes the collection of gold parses particularly difficult. Furthermore, it is often impossible to collect gold answers because in many cases the gold answer set is too large or fuzzily defined (e.g. when searching for objects “near” another one) to be obtained in a reasonable amount of time or without error.

Question-Feedback Pairs

In cases were both the collection of gold parses and gold answers is infeasible, we need to obtain a learning signal from other sources. One option is to obtain feedback from users while they are interacting with the system (Lawrence&Riezler 2018).

For this, a baseline semantic parser is trained on a small amount of question-parse pairs. This parser can be used to parse further questions for which neither gold parses nor gold answers exist. The parse suggested by the baseline, can then be automatically transformed into a set of human understandable statements. Given to human users, they can easily judge each statement as correct or incorrect. This feedback can be used to further improve the parser.

For example, below is a question and the statements automatically generated from the corresponding parse.

With the filled in form, we know which parts of the parse are wrong.

This allows us to go further than just promoting correct parses. For each statement, we are able to map it back to the tokens in the parse that produced it. This allows us to learn from partially correct parses, where we only promote the tokens associated with correct statements.


The collected data decides which objectives can be applied during training. Below we give an overview of various objectives, which data they require and what their advantages and disadvantages are.

First off, here is some general notation:

  • : neural network with parameters
  • : input question
  • : output parse
  • : gold parse
  • : gold answer

We define an objective in terms of a loss function . For training, we derive it with regards to the model’s parameters to make (stochastic) gradient descent updates, , where is a suitable learning rate.

Question-Parse Pairs: Maximum Likelihood Estimation (MLE)

Neural networks are typically trained using MLE, where the probability of a gold parse is raised for given a question (e.g. Dong & Lapata, 2016 or Jia & Liang, 2016). The objective is defined as follows:


However, this approach is only possible if gold targets are available. As mentioned in the first section, obtaining these might be too expensive in praxis and weaker supervision signals are the practical alternative.

There is a further reason for a different objective, even when question-parse pairs are available:

There might be other parses, not just the annotated gold parse, that lead to the correct answer. But these can never be discovered if the MLE objective is used. Discovering further valid parses could stabilise learning and help generalisation. Further, this allows the parser to find suitable parses in its own output space.

Next, we turn to objectives which assume the existence of gold answers. Either from executing gold parses to obtain gold answers or because gold answers where annotated. For these objectives, a parser produces model outputs which are executed to obtain a corresponding answers. The answers can be compared to the available gold answer and a reward can be assigned to the various model outputs.

Question-Answer Pairs: REINFORCE and Minimum Risk Training (MRT)

Recently, there has been a popular surge of applying reinforcement learning approaches, in particular the REINFORCE algorithm (Williams 1992), to (weakly) supervised NLP tasks. The inherent issues that arise in this context, are also explored with regards to neural machine translation in another blog post.

We will first introduce the REINFORCE algorithm, then discuss potential issues.

In REINFORCE, given an input question , one output is sampled from the current model distribution (see Section 13.3 of Sutton & Barto, 2018). Executing this sampled parse to obtain an answer , the comparison with the gold answer provides us with a reward . On the basis of this single reward, the model’s parameters are updated, i.e. we can define the following objective:

However, with just one sample, this objective can suffer from high variance. This can be combated by introducing control variates, which lower variance. The most popular choice is using a baseline, where we keep track of the average reward, which is subtracted from .

But why only sample one output?

We have the luxury of having gold answers available.

This allows us to sample several outputs and obtain rewards for all of them. With this, an average can be computed and on the basis of this average updates to are performed. First, this lowers the variance. Second, it allows us to try out several model outputs, which helps us to explore the output space and in turn increases our chance of finding a parse that leads to the correct answer.

Building an average based on several outputs obtained for one input, is exactly the characteristic idea of Minimum Risk Training (MRT).

MRT was introduced in the context of log-linear models for dependency parsing and machine translation (Smith & Eisner, 2006). It has also been tested for neural models in the context of machine translation (Shen et al., 2016).

Sampling outputs per input, we can define the following MRT objective:

This objective is for example employed in Liang et al., 2017. Although they use the term REINFORCE (“We apply REINFORCE”), their later objective is based upon outputs (“Thus, in contrast with common practice of approximating the gradient by sampling from the model, we use the top- action sequences”), which is reminiscent of MRT. Similarly, Guu et al., 2017 also calculate an average over several output samples for one input (see their Equation 9). Mou et al., 2017 also take advantage of the gold answers to sample and evaluate several parses for one input (“We adjust the reward by subtracting the mean reward, averaged over sampled actions for a certain data point.”).

MRT is superior because by sampling several outputs per input, it exhibits lower variance than REINFORCE. But it can only be applied if question-answer pairs are available. Additionally, it is more expensive to compute.

For our final scenario from the previous section, where neither gold answers nor gold parses are available and we only have feedback collected for one model output, we are limited to only one sample and MRT cannot be applied.

Let’s see which objectives we can apply in such scenarios.

Question-Feedback Pairs: REINFORCE and Counterfactual Learning

A setup, where only one outputs and its corresponding feedback is available, is also called a bandit learning scenario. The name is inspired from choosing one among several slot machines (colloquially referred to as “one-armed bandit”), where we only observe the reward for the chosen machine (i.e. output) and it remains unknown what reward the other machines (or outputs) would have obtained.

This is a crucial contrast to learning from question-answer pairs. We illustrate this graphically in the figure below. The left side shows the scenario where question-answer pairs are available, whereas the right assumes question-feedback pairs where no gold answers are available.


REINFORCE is still applicable in bandit learning scenarios. But if we collect feedback as users are using the system, it can be dangerous to update the parser’s parameters online.

The parser’s performance could deteriorate without notice which can lead to user dissatisfaction and monetary loss. It also makes it impossible to explore different hyperparameter setting.

Instead, it is safer to first collect the feedback in a log of triples . Once enough feedback has been collected, the log can be used to further improve the parser offline. The resulting model can then be validated against additional test sets before it is deployed.

However, once we start learning, the outputs produced in log might no longer be the outputs the updated parser would choose; i.e. the log we collected is biased towards the parser that was deployed at the time. Learning from such a log leads to a counterfactual, or off-policy, learning setup.

The bias in the log can be corrected using importance sampling, where we divide the probability that the new model prescribes to the logged output, by the probability that the deployed parser assigned to that output. This leads to the following Inverse Propensity Score (IPS) objective:

However, because we ideally want to present only correct parses and answers to our users, we want to always show the most likely output under the currently deployed model. This results in a deterministic log because the probability of choosing the most likely output is always one. Consequently, we can no longer correct the data bias.

This leads to the Deterministic Propensity Matching (DPM) objective:

Just like REINFORCE, both IPS and DPM suffer from high variance because only one output received a reward for each input. It is thus advisable to employ control variates, e.g. one-step-late reweighting (Lawrence & Riezler, 2018).

With these objectives it is possible to learn from question-feedback pairs where feedback was collected from users for one system output. This approach is useful for scenarios where neither the collection of gold parses nor the collection of gold answers is feasible.

Furthermore, this approach can also be applied to other tasks, such as machine translation (Lawrence et al., 2017 ; Kreutzer et al., 2018a ; Kreutzer et al., 2018b).


  • Semantic parsers are important modules in virtual personal assistants and with an increasing number of domains in which these assistants are used, we need to find efficient and effective methods to train parsers on new domains.
  • In general, the stronger the learning signal, the better the result. For each new domain, we should estimate the time and cost of the different approaches, while keeping in mind that: question-parse pairs > question-answer pairs > question-feedback pairs.
  • If it is too expensive to obtain gold parses or gold answers, then using counterfactual learning from question-feedback pairs is a viable alternative.

Acknowledgment: Thanks to Julia Kreutzer and Stefan Riezler for their valuable and much needed feedback for improving this post.

Disclaimer: This blogpost reflects solely the opinion of the author, not any of her affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.

Comments, ideas and critical views are very welcome. In particular, feel free to let us know if you think there is an important paper that we should add to this overview! We appreciate your feedback! If you want to cite this blogpost, use this bibfile.