Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg

Lehrveranstaltungen
heiCO
Ressourcen	Fachschaft
Studien-FAQ	Technik-FAQ

Online Learning under Weak (Bandit) Information

Kursbeschreibung

Studiengang	Modulkürzel	Leistungs- bewertung
BA-2010	AS-CL	8 LP
NBA	AS-CL	8 LP
Master	SS-CL, SS-TAC	8 LP
Magister	-	-

Dozenten/-innen	Artem Sokolov, Stefan Riezler
Veranstaltungsart	Hauptseminar
Erster Termin	22.10.2015
Zeit und Ort	Do, 09:15–10:45, INF 325 / SR 23 (SR)
Commitment-Frist	20.01.2016

Leistungsnachweis

Aktive und regelmässige Teilnahme
2 Referate/Vorträge

Inhalt

The goal of online learning is to provide predictions for unseen examples in a sequential fashion, without possibility to pass multiple times over offline training data. The course focuses on online algorithms that provide optimal answers even without full knowledge of the correct labels of the training examples. User cases include web ad placements (the identity of most profitable ads is not known until many are tried), partial feedbacks in crowd-sourcing environments (e.g., a grade how good is some output sentence without post-editing it), etc. Part of the course includes learning in non-cooperative environments, where the inputs and the feedbacks are not assumed to follow any distribution.

Kursübersicht

Seminarplan

Datum	Sitzung	Referent
22.10	Organisation	slides
29.10	Einführungsvortrag	slides
05.11	Hedge: Y. Freund, R. Schapire "A decision-theoretic generalization of on-line learning and an application to boosting" ECOLT, 1995 EXP3/4: P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32: 48- 77, 2002.	Artem Sokolov slides annotated pdf
12.11	P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of the multiarmed bandit problem. Machine learning, 47 2/3:235 – 256, 2002.	Artem Sokolov slides
19.11	J. Langford, T. Zhang. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information. NIPS 2007	Juri Opitz (presentation)
26.11	L. Li et al. A contextual-bandit approach to personalized news article recommendation. WWW. 2010 analysis	Carine Dengler (reading group)
03.12	B. Polyak, Ya. Tsypkin. Pseudogradient adaptation and training algorithms. Avtomat. i Telemekh., Issue 3, Pages 45–68. 1973 A. Sokolov, T. Urvoy, S. Riezler. Bandit Structured Prediction for Learning from Partial Feedback in Statistical Machine Translation. MT Summit. 2015	Julian Hitschler (presentation)
10.12	R. Sutton, D. McAllester, S. Singh, and Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, NIPS, 2000. S.R.K. Branavan, Harr Chen, Luke Zettlemoyer, Regina Barzilay. Reinforcement Learning for Mapping Instructions to Actions. ACL 2009. slides	Arthur Neidlein (presentation) Julia Kreutzer (reading group)
17.12	Watkins, C. J. & Dayan, P. Q-learning Machine Learning 8, 279–292. 1992. Suppl.: RL slides by Csaba Szepesvari 1, 2, 3 V. Mnih et al. Human-level Control through Deep Reinforcement Learning. Nature. 2015	Artem Sokolov (reading group) Zoe Bylinovich (presentation)
24.12	keine Sitzung - Weihnachten
31.12	keine Sitzung - Silvester
07.01	2012, E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: an asymptotically optimal finite time analysis. ICALT, 2012 slides O. Chapelle, L. Li. An empirical evaluation of Thompson sampling. NIPS. 2010	Carine Dengler (reading group) Juri Opitz (reading group)
14.01	M. Dudik et al. Efficient optimal learning for contextual bandits. UAI. 2011 poster A. Agarwal et al. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. ICML. 2014 slides	Julia Kreuzter (presentation) Julian Hitschler (reading group)
21.01	V. Dani, S. Kakade, T. Hayes, The Price of Bandit Information for Online Optimization NIPS 2007 N. Cesa-Bianchi, G. Lugosi. Combinatorial bandits. Journal of Computer and Systems Sciences, 78:1404-1422, 2012. COLT version	Arthur Neidlein (reading group) Artem Sokolov (reading group)
28.01	L. Li et al. "Counterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study", WWW'15 A. Swaminathan, T. Joachims "Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization", JMRL'15	Lauritz Brandt (reading group) Artem Sokolov (reading group)
04.02	S. Bubeck, A. Slivkins. The Best of Both Worlds: Stochastic and Adversarial Bandits. JMLR. 2012 Y. Seldin, A. Slivkins. One practical algorithm for both stochastic and adversarial bandits. ICML, 2014.	Zoe Bylinovich (reading group) Lauritz Brandt (presentation)

Kursmaterialien

S. Shalev-Schwartz. Online Learning and Online Convex Optimization. 2011
S. Bubeck, N. Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. 2012
L. Zhou. A Survey on Contextual Multi-armed Bandits. 2015
C. Szepesvari. Algorithms for Reinforcement Learning. Morgan & Claypool. 2010 (free pdf)