Taming Wild Reward Functions: The Score Function Gradient Estimator Trick

MLE is often not enough to train sequence-to-sequence neural networks in NLP. Instead we employ an external metric, which is a reward function that can help us judge model outputs. The parameters of the network are then updated on the basis of the model outputs and corresponding rewards.

For this update, it is necessary to obtain a derivative.

But how can we do this, if the external function is unknown or cannot be derived?

Enter: The score function gradient estimator trick.

Why MLE is not Enough

Traditionally, neural networks are trained using Maximum Likelihood Estimation (MLE): given an input sequence $x = x_1, x_2, \dots x_{ \mid x \mid }$ and a corresponding gold target sequence $\bar{y} = \bar{y}_1, \bar{y}_2, \dots \bar{y}_{ \mid \bar{y} \mid }$ , we want to increase the probability that the current model $\pi$ with parameters $w$ assigns for the pair $(x,\bar{y})$ . This gives the following loss function:

$\mathcal{L}_{MLE} = - \sum_{j=1}^{ \mid \bar{y} \mid } \log \pi_w(\bar{y}_{j} \mid \bar{y}_{<j}, x),$

where $\bar{y}_{<j} = \bar{y}_1, \bar{y}_2, \dots \bar{y}_{j-1}.$

The parameters $w$ of $\pi$ are then updated using stochastic gradient descent,

$w = w - \eta \nabla_w \mathcal{L}_{MLE}.$

But there are various issues with using MLE that has led researchers to explore alternative objectives. Let’s looks at them next.

1. Gold targets $\bar{y}$ are not Available

This is most prominently the case in many domains of semantic parsing for question-answering, where questions $x$ are mapped to a semantic parse $y$ , which can be executed to obtain an answer $a$ . For many domains, it is easier to collect question-answer pairs, rather than question-parse pairs (e.g. see Berant et al. 2013). But with no gold parses available, MLE cannot be applied.

What can we do instead?

The current model produces a set of likely parses (e.g. by sampling from the model distribution or by employing beam search). Each parse is then executed to obtain an answer. Next, we compare the answer to the gold answer to get a reward $\delta$ . Generally, we have $\delta=0$ if there is no overlap between answer and gold answer and $\delta=1$ if they match exactly. With this, we can update the model’s parameters.

2. Exposure Bias: Ranzato et al. 2016

During traditional MLE training the model is fed the perfect tokens from the available gold target $\bar{y}$ , but at test time the output sequence is produced on the basis of the model distribution. This causes a distribution mismatch and inferior performance.

How can we reduce this mismatch?

Instead, we can feed model output sequences already at training time. Typically, once an entire output sequence has been produced, this sequence is judged by an external metric and the resulting reward $\delta$ can be used as feedback to update the model’s parameters.

3. Loss-Evaluation Mismatch: Wiseman & Rush 2016

MLE is agnostic to the final evaluation metric. Ideally we would like to have the final evaluation metric in the objective used at training time, so that the parameters of the model are specifically tuned to perform well on the intended task.

How can we do that?

Similar to problem (2.), we can feed model output sequences at training time. In this case the external metric is the final evaluation metric. For example, in the case of machine translation, typically a per-sentence approximation of the BLEU score is used.

Maximise the Expected Reward Obtained for Model Outputs

To solve all three problems, we can instead maximise the expected reward $\delta$ or, equivalently, minimise the expected risk $-\delta$ . This can be formulated as the following expectation:

$\mathcal{L}_\delta = \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [-\delta],$

where $p(x)$ is the probability distribution over inputs $x$ and $\pi_w(y \mid x)$ is the probability distribution over outputs $y$ given $x$ .

In praxis, this expectation has to be approximated. For example, using Monte-Carlo sampling leads to the REINFORCE algorithm (Williams 1992): we sample one output $y$ from the model distribution $\pi_w(y \mid x)$ (see also Chapter 13 of Sutton & Barto 2018). Approximating the expectation over $y$ , the actual training objective becomes:

$\mathcal{L}_{REINFORCE} = - \delta \pi_w(y \mid x) \approx \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [-\delta].$

The goal of this objective is to increase the probability of an output proportionally to its reward. The gradient of the REINFORCE objective is an unbiased estimate of the gradient of the $\mathcal{L}_\delta$ objective.

Alternatively, we can use Minimum Risk Training (MRT) (Smith & Eisner ‘06, Shen et al. 2016). Here, several outputs are sampled from the model distribution. This stabilises learning, but requires that more outputs are evaluated to get corresponding rewards. Assuming $S$ sampled outputs, the objective then takes the following form:

$\mathcal{L}_{MRT} = - \frac{1}{S} \sum_{s=1}^{S} \delta_s \pi_w(y_s \mid x) \approx \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [-\delta].$

Due to sampling, both approaches can suffer from high variance, which can be combatted using control variates (see for example Chapter 9 of Ross 2013).

The Problem: The Reward Function cannot be Derived

To minimize $\mathcal{L}_{\delta}$ with stochastic gradient descent, it is necessary to calculate $\nabla_w \mathcal{L}_{\delta}$ , also called the policy gradient in Reinforcement Learning (RL) terms.

But in praxis, the rewards $\delta$ are typically either from an unknown function (e.g. if rewards are collected from human users) or the underlying function cannot be derived (e.g. in the case of BLEU).

As such, it is not immediately clear how to derive $\mathcal{L}_{\delta}$ , i.e. how to calculate $\nabla_w\mathcal{L}_{\delta}.$

The Solution: Score Function Gradient Estimator

To be able to calculate $\nabla_w\mathcal{L}_{\delta}$ , we use two tricks:

1. The $\log$ Derivative Trick

The derivative of the logarithm is:

$\nabla_w \log f = \frac{\nabla_w f}{f}.$

2. The Identity Trick

$f = \frac{g}{g} f$

Now we can formulate what is known as the score function gradient estimator (Fu ‘06):

$\begin{align} \nabla_w \mathcal{L}_\delta &= \nabla_w \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [- \delta] & (1) \\ &= \nabla_w \int_{x} \int_{y} -\delta \, \cdot \, p(x)\textrm{d}x \, \cdot \, \pi_w(y \mid x)\textrm{d}y & (2) \\ &= \int_{x} \int_{y} -\delta \, \cdot \, p(x)\textrm{d}x \, \cdot \, \nabla_w\pi_w(y \mid x)\textrm{d}y & (3) \\ &= \int_{x} \int_{y} -\delta \, \cdot \, p(x)\textrm{d}x \, \cdot \, \nabla_w \frac{\pi_w(y \mid x)}{\pi_w(y \mid x)} \, \cdot \, \pi_w(y \mid x)\textrm{d}y & (4) \\ &= \int_{x} \int_{y} -\delta \, \cdot \, p(x)\textrm{d}x \, \cdot \, \nabla_w \log \pi_w(y \mid x) \, \cdot \, \pi_w(y \mid x)\textrm{d}y & (5) \\ &= \mathbb{E}_{p(x)} \mathbb{E}_{\pi_w(y \mid x)} [-\delta \nabla_w \log \pi_w(y \mid x)]. & (6) \end{align}$

Let’s investigate for each line what happened:

(2): The expectation is expanded into two integrals. $\mathbb{E}_{p(x)}$ becomes $\int_{x} \dots p(x)\textrm{d}x$ and $\mathbb{E}_{\pi_w(y \mid x)}$ turns into $\int_{y} \dots \pi_w(y \mid x)\textrm{d}y$ .
(3): Integral and differentiation can be switched, so we move $\nabla_w$ in front of $\pi_w(y \mid x)$ because $\pi_w(y \mid x)$ is the only term dependent on $w$ .
(4): We use the identity trick with $g = \pi_w(y \mid x)$ .
(5): We use the $\log$ derivative trick.
(6): We still have $\pi_w(y \mid x)\textrm{d}y$ available. With this, we can transform the expression back into an expectation. But in contrast to before, we now have $\nabla_w \log \pi_w(y \mid x)$ and this derivative is simply scaled by $\delta$ .

$\rightarrow$ We no longer need to know what the function that produces $\delta$ looks like or derive it.

For an alternative view on the subject, also see this great blog post.

When can it be applied?

The score function gradient estimator can be applied independent of the underlying model, as long as it has a derivative.

E.g. if $\pi_w(y \mid x)$ is a log-linear model with feature vectors $\phi(x,y)$ ,

$\pi_w(y \mid x) = \frac{e^{ w \phi(x,y)}}{\sum_{y\in \mathbf{Y}(x)} e^{ w \phi(x, y)}},$

then the derivative would be

$\nabla \log \pi_w(y \mid x) = \phi(x,y) - \sum_{y\in \mathbf{Y}(x)} \phi(x, y)\pi_w(y \mid x).$

In the case of neural networks, backpropogation is applied to derive $\nabla_w \pi_w(y \mid x)$ (see for example Chapter 3 of Cho 2015).

Lessons Learnt

MLE can sometimes not be applied or cause inferior performance.
Instead, we leverage rewards from an external metric that evaluates the quality of our model ouputs.
The metric might be unknown or cannot be derived: (stochastic) gradient descent cannot be applied directly.
The score function gradient estimator helps us side-step this problem.

Acknowledgment: Thanks to Julia Kreutzer for her valuable and much needed feedback for improving this post.

Disclaimer: This blogpost reflects solely the opinion of the author, not any of her affiliated organizations and makes no claim or warranties as to completeness, accuracy and up-to-dateness.

Comments, ideas and critical views are very welcome. We appreciate your feedback! If you want to cite this blogpost, use this bibfile.

Carolin Lawrence