Discrete-output Deep Deterministic Policy Gradient

Abstract: Many papers argue that since in NLP sequences are composed of discrete elements, exact backpropagation is not possible. This problem is typically overcome
by using the REINFORCE approach. For continuous action spaces, directly backpropagating from the discriminator to the generator (GANs) or from the critic to the actor (deep deterministic policy gradient) has been shown to work very well. This work formulates an algorithm which combines the deep deterministic policy gradient with recent work that shows how to backpropagate through the stoachstic node of sampling the actions. The results and the problems on a toy task (CartPole) are shown and discussed.