Multi-armed bandit problem with unconventional feedback 

The classical multi-armed bandits (MAB) problem is the formulation of the exploitation-exploration dilemma inherent to reinforcement learning. In this problem, the learner has access to a number of available actions (symbolized by arms) and she has to select one of them (symbolized by pulling an arm), over a period of time. The goal of the learner is to maximize her cumulative reward which is equivalent to minimizing the cumulative regret. The learner obtains information about the actions through the available feedback, which is conventionally assumed to be equal to the reward. However, this assumption does not hold true for some practical scenarios. In this talk, we shall see two kinds of unconventional feedback for the MAB problem. The first feedback we consider is called dueling feedback which is motivated by information retrieval from implicit feedback. In this setting, the learner, at each time period, pulls two arms instead of one and only sees the outcome of the duel between the selected arms while receiving the average of their rewards. We provide a lower bound on the expected regret of any bandit algorithm in this setting. We devise an algorithm called "Relative Exponential-weight algorithm for Exploration and Exploitation" (REX3) for this problem and provide an upper bound on its regret. Another kind of unconventional feedback we consider is called corrupted feedback which is motivated by privacy preservation in online recommender systems. In this setting, the learner observes the transformation of the rewards through a stochastic corruption process with some known parameters and the goal is to maximize the sum of the (unobserved) rewards. We provide a lower bound on the expected regret of any bandit algorithm in this setting. We devise a frequentist algorithm, KLUCB-CF, and a Bayesian algorithm, TS-CF and give upper bounds on their regret. We also provide the appropriate corruption parameters to guarantee a desired level of differential privacy and analyze how this impacts the regret. In this talk, we shall see all the aforementioned algorithms, their brief analyses and the relevant experiments which confirm our theoretical analysis.