Introduction

Use of Model-Free Control

Model-free control can solve problems where

MDP model is unknown, but experience can be sampled
MDP model is known, but is too big to use, except by samples

On/Off-Policy Learning

On-policy Learning
- learn on the job
- learn about policy $\pi$ from experience sampled from $\pi$
Off-policy learning
- look over someone’s shoulder
- learn about policy $\pi$ from experience sampled from $\mu$

On-Policy Monte-Carlo Control

$\epsilon$-Greedy Exploration

Simplest idea for ensuring continual exploration
All m actions are tried with non-zero probability
With probability $1-\epsilon$ choose the greedy action
With probability $\epsilon$ choose an action at random

$\epsilon$-Greedy Policy Improvement

For any $\epsilon$-greedy policy $\pi$, the $\epsilon$-greedy policy $\pi'$ with respect to $q_\pi$ is an improvement: $v_\pi'(s) \geq v_\pi(s)$ $$P_{ss'} = P[S_{t+1}=s' | S_t=s]$$

GLIE

Greedy in the Limit with Infinite Exploration: - All state-action pairs are explored infinitely many times, - The policy converges on a greedy policy

GLIE Monte-Carlo Control

GLIE Monte-Carlo control converges to the optimal action-value function, Q(s,a) converges to $q_*(s,a)$

Sample kth episode using $\pi$
For each state $S_t$ and action $A_t$ in the episode

\[N(S_t, A_t) \leftarrow N(S_t,A_t)+1 Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac{G_t-Q(S_t,A_t)}{N(S_t, A_t)}\]

Improve policy based on new action-value function $$ \epsilon = \frac{1}{k}

\pi = $\epsilon$-greedy(Q) $$

On-Policy Temporal-Difference Learning

Sarsa Algorithm for On-policy Control

Convergence of Sarsa

Sarsa converges to the optimal action-value function, Q(s,a) \rightarrow q_*(s,a), under the following conditions: - GLIE sequences of policies $\pi_t(a | s) - Robbins-Monro sequence of step-sizes $\alpha_t$ $$ \sum_{t=1}^{\inf} \alpha_t = \inf \sum_{t=1}^{\inf} \alpha_t^2 < \inf $$

Forward View Sarsa($\lambda$)

Backward View Sarsa($\lambda$)

Like TD($\lambda$), we use eligibility traces in an online algorithm
Unlike TD($\lambda$), Sarsa($\lambda$) has one eligibility trace for each state-action pair
Q(s,a) is updated for every state a and action a, in proportion to TD-error and the eligibility trace

Off-Policy Learning

Evalute target policy \pi(s,a) to compute $v_\pi(s)$ or $q_\pi(s,a)$ while following behavior policy $\mu(a s)$

Why Important?

Learn from observing humans or other agents
Re-use experience generated from old policies
Learn about optimal policy while following exploratory policy
Learn about multiple policies while following one policy

Importance Sampling

Importance Sampling for Off-policy Monte-Carlo

Use returns generated from $\mu$ to evaluate $\pi$
Weight return $G_t$ according to similarity between policies
Update value towards corrected return
Cannot use if
- $\mu$ is zero when $\pi$ is non-zero
Importance sampling can dramatically increase variance

Importance Sampling for Off-Policy TD

Use TD targets generated from $\mu$ to evaluate $\pi$
Much lower variance than Monte-Carlo importance sampling
Policies only need to be similar over a single step

Q-Learning

We now consider off-policy learning of action values Q(s,a)
No importance sampling is required
Next action is chosen using behavior policy

Off-policy Control with Q-Learning

Allow both behavior and target policies to improve
Target policy $\pi$ is greedy w.r.t. Q(s,a)
Behavior policy $\mu$ is $\epsilon$-greedy w.r.t. Q(s,a)

Q-Learning Control Algorithm

Q-Learning control converges to the optimal action-value function, Q(s,a) -> q_*(s,a)

Q-Learning Algorithm for Off-policy Control

Summary

Relationship between DP & TD

backpropagation