next up previous
Next: Generalized Markov Decision Processes Up: Preliminaries Previous: Preliminaries


Markov Decision Processes with the Expected Reward Criterion

The ultimate goal of decision making is to find an optimal behavior subject to some optimality criterion. Optimizing for the infinite-horizon expected discounted total reward is one of the most studied such criteria. Under this criterion, we are trying to find a policy that maximizes the expected value of $ \sum_{t=0}^{\infty} \gamma^t r_t$, where $ r_t$ is the immediate reward in time step $ t$ and $ 0 \le \gamma < 1$ is the discount factor.

A standard way to find an optimal policy is to compute the optimal value function $ V^*: X \to \mathbb{R}$, which gives the value of each state (the expected cumulated discounted reward with the given starting state). From this, the optimal policy can be easily obtained: the `greedy' policy with respect to the optimal value function is an optimal policy (see e.g. [Sutton and Barto(1998)]). This is the well-known value-function approach [Bellman(1957)].

Formally, the optimal value function satisfies the following recursive system of equations known as Bellman equations [Bellman(1957)]:

$\displaystyle V^*(x) = \max_a \sum_y P(x,a,y) \left( R(x,a,y) + \gamma V^*(y) \right), \quad\textrm{for all } x \in X.$ (1)

Besides the state-value function, some other types of value functions can be defined as well. One example is the state-action-value function $ Q^*: X \times A \to \mathbb{R}$, which satisfies

$\displaystyle Q^*(x,a) = \sum_y P(x,a,y) \left( R(x,a,y) + \gamma \max_{a'} Q^*(y,a') \right), \quad\textrm{for all } x \in X$    

in the optimal case. $ Q^*(x,a)$ has the meaning ``the expected value of taking action $ a$ in state $ x$ while following the optimal policy''. Here the values of state-action pairs are learned instead of state values, which enables model-free learning [Watkins(1989)]. The corresponding learning algorithm is called Q-learning.

It is well known that for $ \gamma < 1$, Equation 1 has a unique solution. The Bellman equations can be rewritten using operators: $ V^*$ is the (unique) fixed point of operator $ T$ (called the dynamic programming operator), where

$\displaystyle [TV](x) = \max_a \sum_y P(x,a,y) \left( R(x,a,y) + \gamma V(y) \right).$    

Since $ T$ is a contraction in max-norm, $ V^*$ can be approximated by iteratively applying $ T$ to an arbitrary initial value function.


next up previous
Next: Generalized Markov Decision Processes Up: Preliminaries Previous: Preliminaries