Dueling Bandits and Other Partial-Feedback Learning Problems

In supervised learning, we usually assume that for each instance we have access to a correct prediction
to compare against. This "full-feedback" setting allows us to evaluate any prediction/action we may want
to consider. The feedback is however more restricted in several real word applications. In on-line advertising
for instance, we are only able to evaluate the ad we actually displayed on a web page. This constraint leads
to a trade-off between "exploration" (a new ad needs to be displayed in order for us to learn its click-through
rate) and "exploitation" (displaying the ad with the current best estimate seems better in the short term).
This is what we call a "bandit feedback" by analogy with a gambler facing several unknown slot machines
and wagering on most rewarding ones. Another interesting example of partial feedback is the ranked
prediction where only a short list of items is proposed for evaluation : the only feedback we have is preference
between the proposed items. Shall we only propose items that we already consider relevant or shall we also
explore apparently irrelevant ones? Dueling Bandits and Cascading Bandits algorithms were recently proposed
to deal with this problem.

I will first survey the different aspects of on-line learning with partial feedback before focusing on
ranked prediction and Dueling Bandits.