Ruprecht-Karls-Universität Heidelberg
Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg

Online Learning under Weak (Bandit) Information


Studiengang Modulkürzel Leistungs-
BA-2010 AS-CL 8 LP
Master SS-CL, SS-TAC 8 LP
Magister - -
Dozenten/-innen Artem Sokolov, Stefan Riezler
Veranstaltungsart Hauptseminar
Erster Termin 22.10.2015
Zeit und Ort Do, 09:1510:45, INF 325 / SR 23 (SR)
Commitment-Frist 20.01.2016


  • Aktive und regelmässige Teilnahme
  • 2 Referate/Vorträge


The goal of online learning is to provide predictions for unseen examples in a sequential fashion, without possibility to pass multiple times over offline training data. The course focuses on online algorithms that provide optimal answers even without full knowledge of the correct labels of the training examples. User cases include web ad placements (the identity of most profitable ads is not known until many are tried), partial feedbacks in crowd-sourcing environments (e.g., a grade how good is some output sentence without post-editing it), etc. Part of the course includes learning in non-cooperative environments, where the inputs and the feedbacks are not assumed to follow any distribution.



Datum Sitzung Referent
22.10 Organisation slides
29.10 Einführungsvortrag slides
05.11 Hedge: Y. Freund, R. Schapire "A decision-theoretic generalization of on-line learning and an application to boosting" ECOLT, 1995
EXP3/4: P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32: 48- 77, 2002.
Artem Sokolov
annotated pdf
12.11 P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of the multiarmed bandit problem. Machine learning, 47 2/3:235 – 256, 2002. Artem Sokolov
19.11 J. Langford, T. Zhang. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information. NIPS 2007 Juri Opitz (presentation)
26.11 L. Li et al. A contextual-bandit approach to personalized news article recommendation. WWW. 2010 analysis Carine Dengler (reading group)
03.12 B. Polyak, Ya. Tsypkin. Pseudogradient adaptation and training algorithms. Avtomat. i Telemekh., Issue 3, Pages 45–68. 1973
A. Sokolov, T. Urvoy, S. Riezler. Bandit Structured Prediction for Learning from Partial Feedback in Statistical Machine Translation. MT Summit. 2015
Julian Hitschler (presentation)
10.12 R. Sutton, D. McAllester, S. Singh, and Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, NIPS, 2000.
S.R.K. Branavan, Harr Chen, Luke Zettlemoyer, Regina Barzilay. Reinforcement Learning for Mapping Instructions to Actions. ACL 2009. slides
Arthur Neidlein (presentation)
Julia Kreutzer (reading group)
17.12 Watkins, C. J. & Dayan, P. Q-learning Machine Learning 8, 279–292. 1992.
Suppl.: RL slides by Csaba Szepesvari 1, 2, 3
V. Mnih et al. Human-level Control through Deep Reinforcement Learning. Nature. 2015
Artem Sokolov (reading group)
Zoe Bylinovich (presentation)
24.12 keine Sitzung - Weihnachten
31.12 keine Sitzung - Silvester
07.01 2012, E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: an asymptotically optimal finite time analysis. ICALT, 2012 slides
O. Chapelle, L. Li. An empirical evaluation of Thompson sampling. NIPS. 2010
Carine Dengler (reading group)
Juri Opitz (reading group)
14.01 M. Dudik et al. Efficient optimal learning for contextual bandits. UAI. 2011 poster
A. Agarwal et al. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. ICML. 2014 slides
Julia Kreuzter (presentation)
Julian Hitschler (reading group)
21.01 V. Dani, S. Kakade, T. Hayes, The Price of Bandit Information for Online Optimization NIPS 2007
N. Cesa-Bianchi, G. Lugosi. Combinatorial bandits. Journal of Computer and Systems Sciences, 78:1404-1422, 2012. COLT version
Arthur Neidlein (reading group)
Artem Sokolov (reading group)
28.01 L. Li et al. "Counterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study", WWW'15
A. Swaminathan, T. Joachims "Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization", JMRL'15
Lauritz Brandt (reading group)
Artem Sokolov (reading group)
04.02 S. Bubeck, A. Slivkins. The Best of Both Worlds: Stochastic and Adversarial Bandits. JMLR. 2012
Y. Seldin, A. Slivkins. One practical algorithm for both stochastic and adversarial bandits. ICML, 2014.
Zoe Bylinovich (reading group)
Lauritz Brandt (presentation)


zum Seitenanfang