Course Information
- Instructor: Chen-Yu Wei
- TA: Haolin Liu (srs8rh at virginia.edu)
- Time: MW 9:30-10:45
- Location: Rice Hall 340
- Office Hours (Instructor): Th 15:30-16:30 at Rice 409
- Office Hours (TA): M 11:00-12:00 at Rice 336
Overview
Reinforcement learning (RL) is a powerful learning paradigm through which machines learn to make (sequential) decisions. It has been playing a pivotal role in advancing artificial intelligence, with notable successes including mastering the game of Go and enhancing large language models.
This course focuses on the design principles of RL algorithms. Similar to statistical learning, a central challenge in RL is to generalize learned capabilities to unseen environments. However, RL also faces additional challenges such as exploration-exploitation tradeoff, credit assignment, and distribution mismatch between behavior and target policies. Throughout the course, we will delve into various solutions to these challenges and provide theoretical justifications.
Prerequisites
This course is mathematically demanding. Students are expected to have strong foundations in probability, linear algebra, and calculus. A basic understanding of machine learning and convex optimization will be beneficial. Proficiency in python programming is required.
Grading
- (60%) Assignments: 4 problem sets, each consisting of theoretical questions and programming tasks.
- (35%) Final project: See here for the specification.
- (5%) Class participation
Platforms
Discussions: Piazza
Homework submissions: Gradescope
Schedule
Date | Topics | Slides and Recommended Reading | Notes |
---|---|---|---|
1/17 | Introduction | Slides | |
1/22 | Multi-armed bandits: explore-then-commit, epsilon-greedy, Boltzmann exploration, UCB, Thompson sampling | Slides Ch. 2 of FR Ch. 6, 7, 8, 36 of LS |
|
1/24 | Linear contextual bandits: LinUCB, linear Thompson sampling | Slides Ch. 3 of FR Ch. 18, 19, 20 of LS Shipra Agrawal’s talk |
|
1/29 | General contextual bandits: UCB for logistic bandits, RegCB, SquareCB | Slides Ch. 3 of FR Dylan Foster’s talk |
|
1/31 | Last day to enroll | ||
2/5 | Adversarial online learning: exponential weight algorithm, projected gradient descent | Slides Ch. 28 of LS 5.5-5.11 of Constantine Caramanis’s channel |
|
2/7 | Adversarial multi-armed bandits: Exp3 | Slides Haipeng Luo’s talk |
|
2/12 | Adversarial linear bandits: one-point gradient estimator + projected gradient descent, doubly robust estimator | Slides Ch. 5, 6 of L |
|
2/14 | Project proposal due on 2/16 | ||
2/19 | Basics of Markov decision processes: Bellman (optimality) equations, reverse Bellman equations, value iteration, (modified) policy iteration, performance difference lemma | Slides Ch. 1.1-1.3 of AJKS Ch. 3 of SB |
|
2/21 | HW1 due on 2/23 | ||
2/26 | |||
2/28 | |||
3/4 | Spring recess | ||
3/6 | Spring recess | ||
3/11 | Approximate value iteration: least-square value iteration (LSVI), Watkins’s Q-learning, deep Q-learning, prioritized replay, double Q-learning | Slides Ch. 3, 7 of AJKS Lec. 7, 8 of Sergey Levine’s course |
|
3/13 | HW2 due on 3/17 | ||
3/18 | Policy evaluation: least-square policy evaluation (LSPE), temporal difference (TD) learning, Monte Carlo estimation, TD(λ) | Slides Ch. 5.1, 5.2, 5.5, 6.1-6.3, 9.1-9.4, 11.1-11.3, 12.1-12.5 of SB |
|
3/20 | |||
3/25 | Policy-based learning methods: least-square policy iteration (LSPI), policy gradient, natural policy gradient (NPG) | Slides Notes of J on 3/24-3/31 Lec. 5, 6, 9 of Sergey Levine’s course W. van Heeswijk’s paper |
|
3/27 | Project milestone due on 3/29 | ||
4/1 | |||
4/3 | Actor-critic methods: advantage actor-critic (A2C), proximal policy optimization (PPO), deep deterministic policy gradient (DDPG), twin-delayed DDPG (TD3), soft actor-critic (SAC) | Slides Algorithms Docs in Spinning Up References in the slides |
HW3 due on 4/7 |
4/8 | |||
4/10 | |||
4/15 | Student presentation | ||
4/17 | Student presentation | ||
4/22 | Student presentation | ||
4/24 | Student presentation | ||
4/29 | Summary | Slides | HW4 due on 5/10 |
Books and Lecture Notes
- Bandit Algorithms by Tor Lattimore and Csaba Szepesvari
- Reinforcement Learning: An Introduction by Richard Sutton and Andrew Barto
- Reinforcement Learning: Theory and Algorithms by Alekh Agarwal, Nan Jiang, Sham Kakade, and Wen Sun
- Statistical Reinforcement Learning and Decision Making: Course Notes by Dylan Foster and Sasha Rakhlin
Related Courses at Other Institutions
- Deep Reinforcement Learning by Sergey Levine
- Reinforcement Learning by Emma Brunskill
- RL Lecture Series by Hado van Hasselt
- Introduction to Reinforcement Learning by Lucas Janson and Sham Kakade
- Introduction to Reinforcement Learning and Foundations of Reinforcement Learning by Wen Sun
- Topics in Bandits and Reinforcement Learning Theory by Chicheng Zhang
- Foundations of Reinforcement Learning by Chi Jin
- Reinforcement Learning and Statistical Reinforcement Learning by Nan Jiang
- Theoretical Foundations of Reinforcement Learning by Csaba Szepesvari
- Theory of Reinforcement Learning by Ambuj Tewari
- Theory of Multi-armed Bandits and Reinforcement Learning by Jiantao Jiao
- Statistical Reinforcement Learning and Decision Making by Dylan Foster and Sasha Rakhlin
- Introduction to Online Optimization/Learning by Haipeng Luo
Previous RL Courses at UVA
- Topics in Reinforcement Learning by Shangtong Zhang
- Reinforcement Learning by Hongning Wang