
Online Learning under Weak (Bandit) Information
Kursbeschreibung
Studiengang | Modulkürzel | Leistungs- bewertung |
---|---|---|
BA-2010 | AS-CL | 8 LP |
NBA | AS-CL | 8 LP |
Master | SS-CL, SS-TAC | 8 LP |
Magister | - | - |
Dozenten/-innen | Artem Sokolov, Stefan Riezler |
Veranstaltungsart | Hauptseminar |
Erster Termin | 22.10.2015 |
Zeit und Ort | Do, 09:15–10:45, INF 325 / SR 23 (SR) |
Commitment-Frist | 20.01.2016 |
Leistungsnachweis
- Aktive und regelmässige Teilnahme
- 2 Referate/Vorträge
Inhalt
The goal of online learning is to provide predictions for unseen examples in a sequential fashion, without possibility to pass multiple times over offline training data. The course focuses on online algorithms that provide optimal answers even without full knowledge of the correct labels of the training examples. User cases include web ad placements (the identity of most profitable ads is not known until many are tried), partial feedbacks in crowd-sourcing environments (e.g., a grade how good is some output sentence without post-editing it), etc. Part of the course includes learning in non-cooperative environments, where the inputs and the feedbacks are not assumed to follow any distribution.
Kursübersicht
Seminarplan
Datum | Sitzung | Referent |
22.10 | Organisation | slides |
29.10 | Einführungsvortrag | slides |
05.11 | Hedge: Y. Freund, R. Schapire "A decision-theoretic generalization of on-line learning and an application to boosting" ECOLT, 1995 EXP3/4: P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32: 48- 77, 2002. | Artem Sokolov slides annotated pdf |
12.11 | P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of the multiarmed bandit problem. Machine learning, 47 2/3:235 – 256, 2002. | Artem Sokolov slides |
19.11 | J. Langford, T. Zhang. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information. NIPS 2007 | Juri Opitz (presentation) |
26.11 | L. Li et al. A contextual-bandit approach to personalized news article recommendation. WWW. 2010 analysis | Carine Dengler (reading group) |
03.12 | B. Polyak, Ya. Tsypkin. Pseudogradient adaptation and training algorithms. Avtomat. i Telemekh., Issue 3, Pages 45–68. 1973 A. Sokolov, T. Urvoy, S. Riezler. Bandit Structured Prediction for Learning from Partial Feedback in Statistical Machine Translation. MT Summit. 2015 | Julian Hitschler (presentation) |
10.12 | R. Sutton, D. McAllester, S. Singh, and Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, NIPS, 2000. S.R.K. Branavan, Harr Chen, Luke Zettlemoyer, Regina Barzilay. Reinforcement Learning for Mapping Instructions to Actions. ACL 2009. slides | Arthur Neidlein (presentation) Julia Kreutzer (reading group) |
17.12 | Watkins, C. J. & Dayan, P. Q-learning
Machine Learning 8, 279–292. 1992. Suppl.: RL slides by Csaba Szepesvari 1, 2, 3 V. Mnih et al. Human-level Control through Deep Reinforcement Learning. Nature. 2015 | Artem Sokolov (reading group) Zoe Bylinovich (presentation) |
24.12 | keine Sitzung - Weihnachten | |
31.12 | keine Sitzung - Silvester | |
07.01 | 2012, E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: an asymptotically optimal finite time analysis. ICALT, 2012 slides O. Chapelle, L. Li. An empirical evaluation of Thompson sampling. NIPS. 2010 | Carine Dengler (reading group) Juri Opitz (reading group) |
14.01 | M. Dudik et al. Efficient optimal learning for contextual bandits. UAI. 2011 poster A. Agarwal et al. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. ICML. 2014 slides | Julia Kreuzter (presentation) Julian Hitschler (reading group) |
21.01 | V. Dani, S. Kakade, T. Hayes, The Price of Bandit Information for Online Optimization NIPS 2007 N. Cesa-Bianchi, G. Lugosi. Combinatorial bandits. Journal of Computer and Systems Sciences, 78:1404-1422, 2012. COLT version | Arthur Neidlein (reading group) Artem Sokolov (reading group) |
28.01 | L. Li et al. "Counterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study", WWW'15 A. Swaminathan, T. Joachims "Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization", JMRL'15 |
Lauritz Brandt (reading group) Artem Sokolov (reading group) |
04.02 | S. Bubeck, A. Slivkins. The Best of Both Worlds: Stochastic and Adversarial Bandits. JMLR. 2012 Y. Seldin, A. Slivkins. One practical algorithm for both stochastic and adversarial bandits. ICML, 2014. | Zoe Bylinovich (reading group) Lauritz Brandt (presentation) |
Kursmaterialien
- S. Shalev-Schwartz. Online Learning and Online Convex Optimization. 2011
- S. Bubeck, N. Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. 2012
- L. Zhou. A Survey on Contextual Multi-armed Bandits. 2015
- C. Szepesvari. Algorithms for Reinforcement Learning. Morgan & Claypool. 2010 (free pdf)