1.Introduction

What is dynamic programming?

  • Dynamic sequential or temporal component to the problem

  • Programming optimizing a program, i.e. a policy (say start with a mapping, optimize the behavior)

    • c.f. linear programming

Put the two ideas together ;)

  • A method for solving complex problems

  • By breaking them down into subproblems

    • Solve the subproblems

Requirements for Dynamic Programming

Dynamic programming is a very general solution method for problems which have two properties:

  • Optimal substructure
    • Principle of optimality applied

    • Optimal solution can be decomposed into subproblems

  • Overlapping subproblems
    • Subproblems recur many times
  • Markov decision process satisfy both properties
    • Bellman equation gives recursive decomposition

    • Value function stories and reuses solutions

Planning by Dynamic Programming

  • Dynamic programming assumes full knowledge of the MDP

  • It is used for planning in an MDP

  • For prediction:
    • Input: MDP $\langle \mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\mathcal{\gamma} \rangle$ and policy $\pi$

    • or: MRP $\langle \mathcal{S},\mathcal{P^{\pi}},\mathcal{R^{\pi}}, \gamma$

    • Output: value function $v_\pi$

  • Or for control:
    • Input: MDP $\langle \mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\mathcal{\gamma} \rangle$

    • insteqd of given policy, want to know what the best policy solving mdp what is the best thing to do under the mdp best mapping from state to action achieve the long term reward in mdp

    • Output: optimal value function $v_\pi$

    • and equivalently: optimal policy $\pi_*$

Not only used in MDPs

Dynamic programming is used to solve many other problems, e.g. optimal substructure

  • Scheduling algorithms

  • String algorithms (e.g. sequence alignment)

  • Graph algorithms (e.g. shortest path algorithms)

  • Graphical models (e.g. Viterbi algorithm)

  • Bioinformatics (e.g. lattice models)

2.Policy Evaluation

Iterative Policy Evaluation

  • Problem: evaluate a given policy $\pi$

  • Solution: iterate application of Bellman expectation backup Bellman equation to do policy evaluation Bellman optimization to do control Take the Bellman equation and turn it into iterative update: our first mechanism for evaluating policies

  • $v_1 \rightarrow v_2 \dots \rightarrow v_\pi$ one step lookhead using Bellman equation to find out v_2

  • Using synchronous backups/evaluations,
    • At each iteration k + 1,

    • For all states $s \in S$

    • Update $v_{k+1}(s)$ from $v_k(s’)$

    • where s’ is a successor state of s

  • We will discuss asynchronous backups laters

  • Convergence to $v_\pi$ will be proven at the end of the lecture

Iterative Policy Evaluation - continued

\(v_{k+1}(s) = \sum \pi(a|s) \left(\mathcal{R}_s^a + \gamma \sum \mathcal{R}_s^a v_k(s') \right)\)

  • $\pi(a s)$ represents the probability
  • $\left(\mathcal{R}_s^a + \gamma \sum \mathcal{R}_s^a v_k(s’) \right)$ represents the long-term reward

More concisely,

\[v_{k+1} = \mathcal{R}^\pi + \gamma \mathcal{P}_\pi v^k\]

3.Policy Iteration

How to Improve a Policy

  • Given a policy $\pi$,
    • Evaluate the policy $\pi$,

    \(v_\pi(s) = E [R_{t+1} + \gamma R_{t+2} + \dots|S_t = s]\)

    • improve the policy by acting greedily with respect to $v_\pi$ \(\pi' = greedy(v_\pi)\)
  • In Small Gridworld improved policy was optimal, $\pi’=\pi^*$

  • In general, need more iterations of improvement/evaluation

  • But this process of policy iteration always converges to $\pi^*$

Policy Improvement

  • Consider a deterministic policy $a = \pi(s)$

  • We can improve the policy by acting greedily

\(\pi'(s) = \argmax_{a \in \mathcal{A}} q_\pi(s,a)\)

  • This improves the value from any state s over one step

  • It therefore improves the value function

  • If improvements stop,

  • Then the Bellman optimality equation has been satistfied

  • Therefore $v_\pi(s)=v_*(s)$ for all $s in \mathcal{S}$, and $\pi$ is an optimal policy.

Modified Policy Iteration

  • Does policy evaluation need to converge to $v_\pi$?
    • or should we introduce a stopping condition,

    • or simply stop after k iterations of iterative policy evaluation?

  • E.g., in the small gridworld k=3 was sufficient to achieve optimal policy

  • Why not update policy every iteration? i.e. stop after k=1
    • This is equivalent to value iteration

Generalized Policy Iteration

  • Policy evalution
    • Estimate v_\pi any policy evaluation algorithm
  • Policy improvement
    • Generate $\pi’ > \pi$ any policy improvement algorithms

4.Value Iteration

Principle of Optimality

Any optimal policy can be subdivided into two components:

  • An optimal first action $A_*$

  • Followed by an optimal policy from successor state S’

Deterministic Value Iteration

  • If we know the solution to subproblem $v_*(s’)$, then solution can be found by one-step lookahead \(v_{*}(s) = \sum \pi(a|s) \left(\mathcal{R}_s^a + \gamma \sum \mathcal{R}_s^a v_k(s') \right)\)

  • The idea of value iteration is to apply these updates iteratively

  • Intuition: start with final rewards and work backwards

  • Still works with loopy, stochastic MDPs

5.Extensions to Dynamic Programming

Asynchronous Dynamic Programming

  • DP methods described so far used synchronous backups

  • i.e. all states are backed up in parallel

  • Asynchronous DP backs up states individually, in any order

  • For each selected state, apply the appropriate backup

  • Can significantly reduce computation

  • Guaranteed to converge if all states continue to be selected

  • Three ideas for asynchronous dynamic programming:

    • In-place dynamic programming
    • Prioritized sweeping
    • Real-time dynamic programming

In-Place Dynamic Programming

  • Synchronous value iteration stores two copies of value function for all s in $\mathcal{S}$

  • In-place value iteration only stores one copy of value function for all s in $\mathcal{S}$

Prioritized Sweeping

  • Use magnitude of Bellman error to guide state selection, e.g.

  • Backup the state with the largest remaining Bellman error

  • Update Bellman error of affected states after each backup

  • Requires knowledge of reverse dynamics (predecessor states)

  • Can be implemented efficiently by maintaining a priority queue

Real-Time Dynamic Programming

  • Idea: only states that are relevant to agent

  • Use agent’s experience to guide the selection of states

  • After each time-step $S_t, A_t, R_{t+1}$

  • Backup the state $S_t$

Full-Width Backups

  • DP uses full-width backups

  • For each backup (sync or async)
    • Every successor state and action is considered

    • Using knowledge of the MDP transitions and reward function

  • DP is effective for medium-sized (millions) problems

  • For large problems, DP suffers Bellman’s curse of dimensionality:
    • Number of states grows exponentially with number of state variables
  • Even one backup can be too expensive

Sample Backups

  • Using sample rewards and sample transitions \(\langle \mathcal{S},\mathcal{A},\mathcal{R},\mathcal{S'}\) instead of reward function $\mathcal{R}$ and transition dynamics $\mathcal{P}$

  • Advantages:

    • Model-free: no advance knowledge of MDP required

    • Breaks the curse of dimensionality through sampling

    • Cost of backup is constant, independent of $n=\mathcal{S}$

Approximate Dynamic Programming

  • Approximate the value function

  • Using a function approximator

  • Apply dynamic programming to

Some Technical Questions

  • How do we know that value iteration converges to $v_*$?

  • Or that iterative policy evaluation converges to $v_\pi$?

  • And therefore that policy iteration converges to $v_*$?

  • Is the solution unique?

  • How fast do these algorithms converge?

  • These questions are resolved by contraction mapping theorem.

Value Function Space

  • Consider the vector space $\mathcal{V}$ over value functions

  • There are $\mathcal{S}$ dimensions

  • Each point in this space fully specifies a value function

  • What does a Bellman backup do to points in this space?

    • It will bring value functions closer

    • Therefore the backups must converge on a unique solution

Value Function $\inf$-Norm

  • We will measure distance between state-value functions u and v by the $\inf$-Norm

  • i.e. the largest difference between state values

6.Contraction Mapping

Bellman Expectation Backup is a Contraction

  • Define the Bellman expectation backup operator

  • This operator is a $\gamma$-contraction, meaning it makes value functions closer by at least $\gamma$

Contraction Mapping Theorem

Convergence of Iterative Policy Evaluation/Iteration

  • The Bellman expectation operator $T_\pi$ has a unique fixed point

  • By contraction mapping theorem:

    • Iterative policy evaluation converges on $v_\pi$

    • Policy iteration converges on $v_*$

Bellman Optimality Backup is a Contraction

  • Define the Bellman optimality backup operator

  • This operator is a $\gamma$-contraction, meaning it makes value functions closer by at least $\gamma$

Convergence of Value Iteration

  • The Bellman optimality operator $T^*$ has a unique fixed point

  • By Bellman optimality equation, $v_$ is a fixed point of $T^$

  • By contraction mapping theorem

  • Value iteration converges on $v_*$