Markov Processes

Introduction to MDPs

  • Markov decision process formally describe an environment for reinforcement learning
  • Where the environment is fully observable
  • i.e. The current state completely characterizes the process
  • Almost all RL problems can be formalized as MDPs
    • Optimal control primarily deals with continuous MDPs
    • Partially observable problems can be converted into MDPs
    • Bandits are MDPs with one state

Markov Property

“The future is independent of the past given the present”

A state is Markov if and only if $$P[S_{t+1} |S_t] = P[S_{t+1} | S_1, ... S_t]$$
  • The state captures all relevant information from the history.

  • Once the state is known, the history may be thrown away
  • i.e. The state is a sufficient statistic of the future.

State Transition Matrix

For a Markov state s and successor state $s’$, the state transition probability is defined by:

\[\mathcal{P} = P[S_{t+1}=s' | S_t=s]\]

State transition matrix $\mathcal{P}$ defines transition probabilities to all successor states $s’$

\(\mathcal{P} = \begin{bmatrix} \mathcal{P}_{11} & \cdots & \mathcal{P}_{11} \\ \vdots \\ \mathcal{P}_{11} & \cdots & \mathcal{P}_{nn} \end{bmatrix}\) where each row of the matrix sums to 1.

Markov Process

A Markov process is a memoryless random process, i.e. a sequence of random states $S_1$, $S_2$, $\cdots$ with the Markov property.

A Markov Process is a tuple <S, P> - $\mathcal{S}$ is a finite set of states - $\mathcal{P}$ is a state transition probability matrix $$P_{ss'} = P[S_{t+1}=s' | S_t=s]$$

Markov Reward Processes

Markov Reward Process

A Markov reward process is a Markov chain with values.

A Markov Reward Process is a tuple $\langle \mathcal{S}, \mathcal{P}, \mathcal{R}, \gamma \rangle$ - $\mathcal{S}$ is a finite set of states - $\mathcal{P}$ is a state transition probability matrix $$P_{ss'} = P[S_{t+1}=s' | S_t=s]$$ - $\mathcal{R}$ is a reward function, $$\mathcal{R}_s = E[R_{t+1}=s' | S_t=s]$$
  • $\gamma$ is the discount factor

Return

The return $G_t$ is the total discounted reward from time-step t. $$ G_t = R_{t+1} + \gamma R_{t+2} + \cdots $$
  • The discount $\gamma$ is the present value of future rewards
  • The value of receiving R after k+1 time-steps is ${\gamma}^k R$
  • This values immediate reward above delayed reward
    • $\gamma$ close to 0 leads to “myopic” evaluations
    • $\gamma$ close to 1 leads to “far-sighted” evaluation

Why discount?

Most Markov reward and decision processes are discounted.

  • Mathematically convenient to discount rewards
  • Avoids infinite returns in cyclic Markov processes
  • Uncertainty about the future may not be fully represented
  • If the reward is financial, immediate rewards may earn more interest than delayed rewards
  • Animal/human behavior shows preference for immediate reward
  • It is sometimes possible to use undiscounted Markov reward processes (i.e. $\gamma=1$), e.g. if all sequences terminate

Value Function

  • The value function $v(s)$ gives the long-term value of state s
The state value function $v(s)$ of an MRP is the expected return starting from state s $$ v(s) = E[G_t | S_t = s] $$

Bellman Equation for MRPs

The value function can be decomposed into two parts:

  • immediate reward R_{t+1}
  • discounted value of successor state $\gamma v(S_{t+1})$
\[v(S) = E[R_{t+1} + \gamma v(S_{t+1}) | S_t=s]\]

backpropagation

Bellman Equation in Matrix Form

The Bellman equation can be expressed concisely using matrices,

\(v = \mathcal{R} + \gamma \mathcal{P}v\) where v is a column vector with one entry per state

\[\begin{bmatrix} v_{1} \\ \vdots \\ v_{n} \end{bmatrix} = \begin{bmatrix} \mathcal{R}_{1} \\ \vdots \\ \mathcal{R}_{n} \end{bmatrix} + \gamma\begin{bmatrix} \mathcal{P}_{11} & \dots & \mathcal{P}_{1n} \\ \vdots \\ \mathcal{P}_{11} & \dots & \mathcal{P}_{nn} \end{bmatrix}\begin{bmatrix} v_{1} \\ \vdots \\ v_{n} \end{bmatrix}\]

$R_n$ represents the reward enjoyed for exiting state n.

Solving the Bellman Equation

  • The Bellman equation is a linear equation
  • It can be solved directly $$ v = \mathcal{R} + \gamma \mathcal{R} v

(I - \gamma \mathcal{R})v = \mathcal{R}

v = (I - \gamma \mathcal{R})^{-1}\mathcal{R} $$

  • Computational complexity (matrix inversion) $O(n^3)$ for n states
  • Direct solution only possible for small MRPs
  • There are many iterative methods for large MRPs
    • Dynamic programming
    • Monte-Carlo evaluation
    • Temporal-Difference learning

Markov Decision Processes

Markov Decision Process

A Markov Reward Process is a tuple $\langle \mathcal{S}, \mathcal{P}, \mathcal{R}, \gamma \rangle$ - $\mathcal{S}$ is a finite set of states - $\mathcal{A}$ is a finite set of actions - $\mathcal{P}$ is a state transition probability matrix $$P_{ss'} = P[S_{t+1}=s' | S_t=s]$$ - $\mathcal{R}$ is a reward function, $$\mathcal{R} _s = E[R_{t+1}=s' | S_t=s]$$

Policies

A policy $\pi$ is a distribution over actions given states, $$\pi_{a|s} = P[A_{t}=a | S_t=s]$$
  • A policy fully defines the behavior of an agent
  • MDP policies depend on the current state (not the history)
  • i.e., Policies are stationary (time-independent), $A_t ~ \pi(\dot | S_t)
  • Given an MDP $\mathcal{M}=\langle \mathcal{S},\mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle$ and a policy $\pi$
  • The state sequence $S_1, S_2, \dots$ is a Markov process \langle \mathcal{S},\mathcal{P}_{\pi} \rangle
  • The sate and reward sequence, $S_1, R_2, S_2, \dots$ is a Markov Reward process $S, \mathcal{P}{\pi} , \mathcal{R}{\pi} , \gamma$

Value Function

The state-value function $v_\pi(s)$ of an MDP is the expected return starting from state s, and then following policy $\pi$, $$v_\pi(s) = E_\pi[G_t | S_t=s]$$
The action-value function $q_\pi(s,a)$ is the expected return starting from state s, taking action a, and then following policy $\pi$, $$q_\pi(s,a) = E_\pi[G_t | S_t=s, A_t = a]$$

E.g. action: take left or right?

Bellman Expectation Equation

The state-value function can again be decomposed into immediate reward plus discounted value of successor state.

\[v_\pi(s) = E_\pi [R_{t+1} +\gamma v_\pi(S_{t+1}) | S_t = s]\]

The action-value function can similarly be decomposed,

\[q_\pi(s,a) = E_\pi [R_{t+1} +\gamma q_\pi(S_{t+1}, A_{t+1}) | S_t = s, A_t = a]\]

Bellman Expectation Equation for $v_\pi$

backpropagation

Bellman Expectation Equation for $q_\pi$

backpropagation

Bellman Expectation Equation for $v_\pi$ - revisit

backpropagation

How to find the best decison?

Bellman Expectation Equation (Matrix Form)

The Bellman expectation equation can be expressed concisely using the induced MRP, \(v_\pi = \mathcal{R}^{\pi} + \gamma \mathcalP^{\pi}v_\pi\) Direct solution is \(v_\pi = (I - \gamma \mathcal{P}^{\pi})^{-1} \mathcalR^{\pi}\)

Optimal Value Function

The optimal state-value function $v_*(s)$ is the maximum value function over all policies, $$v_*(s) = \max_\pi v_\pi (s)$$ The optimal action-value function is the maximum value function over all policies $$q_*(s,a) = \max_\pi q_\pi (s,a)$$
  • THe optimal value function specifies the best possible performance in MDP.

  • An MDP is “solved” when we know the optimal value fn/finding q*.

Optimal Policy

Define a partial ordering over policies \(\pi \geq \pi' if v_\pi (s) \geq v_{\pi'}(s) \forall s\)

For any Markov Decision Process - There exists an optimal policy $\pi_*$, that is better than or equal to all other policies $\pi_* \geq \pi.\forall \pi$ - All optimal policies achieve the optimal value function (possible to have more than one optimal) $v_{\pi_*}(s) = v_*(s)$ - All optinal policies achieve the optimal action-value function $q_{\pi_*}(s,a) = q_*(s,a)$

Finding an Optimal Policy

An optimal policy can be found by maximizing over $q_* (s,a)$

  • There is always a deterministic optimal policy for any MDP
  • If we know $q_* (s,a)$, we immediately have the optimal policy

Bellman Optimality Equation for $v_*$

The optimal value functions as recursively related by the Bellman optimality equations: \(v_* (s) = \max_a q_*(s,a)\) backpropagation

Bellman Optimality Equation for $q_*$

backpropagation

Bellman Optimality Equation for $v_*$ - revisit

backpropagation

Bellman Optimality Equation for $q_*$ - revisit

backpropagation

Solving the Bellman Optimality Equation

  • Bellman Optimality Equation is non-linear
  • No closed form solution (in general)
  • Many iterative solution methods
    • Value Iteration
    • Policy Iteration
    • Q-learning
    • Sarsa

Approaches to uncertainty:

  • Explicitly represent in MDP
  • Factor in uncertainty in MDP itself
  • Sufficient to make discount factor

Bellman’s principle of optimality

The dynamic programming method breaks this decision problem into smaller subproblems. Bellman’s principle of optimality describes how to do this:

Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. (See Bellman, 1957, Chap. III.3.)[Wikipedia]

Extensions to MDPs

  • Infinite and continuous MDPs
  • Partially observable MDPs
  • Undiscounted, average reward MDPs