Bandit structured prediction describes the task of online learning from partial information in form of bandit feedback to predicted structures. We approach this problem by stochastic zeroth-order (SZO) methods that approximate gradients by function evaluations under perturbations in parameter space. In contrast to stochastic first-order optimization that directly works with gradients of differentiable functions, we can exploit non-differentiable criteria such as maximum-a-posteriori (MAP) inference at training and test time.

We also investigate ways to mitigate the poor convergence properties of SZO methods. There is preliminary evidence that if random perturbations in large,sparse feature spaces are restricted to features that are active for particular inputs and their corresponding output spaces, the dimensionality affecting iteration complexity is reduced to the maximal number of active features.