Reinforcement Learning (Spring 2025)

Reinforcement Learning (Spring 2025)



Course Information

Overview

Reinforcement learning (RL) is a powerful learning paradigm through which machines learn to make (sequential) decisions. It has been playing a pivotal role in advancing artificial intelligence, with notable successes including mastering the game of Go and enhancing large language models.

This course focuses on the design principles of RL algorithms. Similar to statistical learning, a central challenge in RL is to generalize learned capabilities to unseen environments. However, RL also faces additional challenges such as exploration-exploitation tradeoff, credit assignment, and distribution mismatch between behavior and target policies. Throughout the course, we will delve into various solutions to these challenges and provide theoretical justifications.

Prerequisites

This course is mathematically demanding. Students are expected to have strong foundations in probability, linear algebra, and calculus. A basic understanding of machine learning and convex optimization will be beneficial. Proficiency in python programming is required.

Topics

Bandits, online learning, dynamic programming, Q-learning, policy evaluation, policy gradient.

Platforms

Grading

Late policy for assignments: 10 free late days can be used across all assignments. Each additional late day will result in a 10% deduction in the semester’s assignment grade. No assignment can be submitted more than 7 days after its deadline.

Schedule

One slide deck may be used for multiple lectures.

Date Topics Materials Assignments
1/13 Introduction Slides, Recording HW0 (no submission needed)
1/15 Value-based bandits: Explore-then-exploit, ε-greedy Slides, Recording  
1/20 MLK Holiday    
1/22 Boltzmann exploration, Inverse gap weighting, Reduction Recording, Supp-IGW  
1/27 UCB, TS Recording HW1 out
1/29 Policy-based bandits: Exponential weights (full-information) Slides, Recording  
2/3 EXP3 Recording  
2/5 PPO Recording HW1 due on 2/7
2/10 NPG, PG Recording  
2/12 Bandits with continuous actions: Gradient ascent Slides, Recording HW2 out
2/17 One-point gradient estimators Recording  
2/19 PG, PPO Recording  
2/24 Markov decision process Slides, Recording  
2/26 Dynamic programming Recording HW2 due on 2/28
3/3 Dynamic programming Recording  
3/5      
3/10 Spring recess    
3/12 Spring recess    
3/17      
3/19      
3/24      
3/26      
3/31      
4/2      
4/7      
4/9      
4/14      
4/16      
4/21      
4/23      
4/28      

Resources

Previous Offerings