Reinforcement Learning (Spring 2024)

Reinforcement learning (RL) is a powerful learning paradigm through which machines learn to make (sequential) decisions. It has been playing a pivotal role in advancing artificial intelligence, with notable successes including mastering the game of Go and enhancing large language models.

This course focuses on the design principles of RL algorithms. Similar to statistical learning, a central challenge in RL is to generalize learned capabilities to unseen environments. However, RL also faces additional challenges such as exploration-exploitation tradeoff, credit assignment, and distribution mismatch between behavior and target policies. Throughout the course, we will delve into various solutions to these challenges and provide theoretical justifications.


This course is mathematically demanding. Students are expected to have strong foundations in probability, linear algebra, and calculus. A basic understanding of machine learning and convex optimization will be beneficial. Proficiency in python programming is required.



Discussions: Piazza
Homework submissions: Gradescope


Date Topics Slides and Recommended Reading Notes
1/17 Introduction Slides  
1/22 Multi-armed bandits: explore-then-commit, epsilon-greedy, Boltzmann exploration, UCB, Thompson sampling Slides
Ch. 2 of FR
Ch. 6, 7, 8, 36 of LS
1/24 Linear contextual bandits: LinUCB, linear Thompson sampling Slides
Ch. 3 of FR
Ch. 18, 19, 20 of LS
Shipra Agrawal’s talk
1/29 General contextual bandits: UCB for logistic bandits, RegCB, SquareCB Slides
Ch. 3 of FR
Dylan Foster’s talk
2/5 Adversarial online learning: exponential weight algorithm, projected gradient descent Slides
Ch. 28 of LS
5.5-5.11 of Constantine Caramanis’s channel
2/7 Adversarial multi-armed bandits: Exp3 Slides
Haipeng Luo’s talk
2/12 Adversarial linear bandits: one-point gradient estimator + projected gradient descent, doubly robust estimator Slides
Ch. 5, 6 of L
2/19 Basics of Markov decision processes: Bellman (optimality) equations, reverse Bellman equations, value iteration, (modified) policy iteration, performance difference lemma Slides
Ch. 1.1-1.3 of AJKS
Ch. 3 of SB
3/11 Approximate value iteration: least-square value iteration (LSVI), Watkins’s Q-learning, deep Q-learning, prioritized replay, double Q-learning Slides
Ch. 3, 7 of AJKS
Lec. 7, 8 of Sergey Levine’s course
3/18 Policy evaluation: least-square policy evaluation (LSPE), temporal difference (TD) learning, Monte Carlo estimation, TD(λ) Slides
Ch. 5.1, 5.2, 5.5, 6.1-6.3, 9.1-9.4, 11.1-11.3, 12.1-12.5 of SB
3/25 Policy-based learning methods: least-square policy iteration (LSPI), policy gradient, natural policy gradient (NPG) Slides
Notes of J on 3/24-3/31
Lec. 5, 6, 9 of Sergey Levine’s course
W. van Heeswijk’s paper
4/3 Actor-critic methods: advantage actor-critic (A2C), proximal policy optimization (PPO), deep deterministic policy gradient (DDPG), twin-delayed DDPG (TD3), soft actor-critic (SAC) Slides
Algorithms Docs in Spinning Up
References in the slides
HW3 due on 4/7
4/29 Summary Slides

