next up previous
Next: Q-learning in Generalized MDPs Up: Preliminaries Previous: Markov Decision Processes with


Generalized Markov Decision Processes

[Szepesvári and Littman(1996)] have introduced a more general model. Their basic concept is that in the Bellman equations, the operation $ \sum_y P(x,a,y) \ldots$ (i.e., taking expected value w.r.t. the transition probabilities) describes the effect of the environment, while the operation $ \max_a ...$ describes the effect of an optimal agent (i.e., selecting an action with maximum expected value). Changing these operators, other well-known models can be recovered.

A generalized MDP is defined by the tuple $ \langle X, A, R,
{\textstyle\bigoplus}, {\textstyle\bigotimes}\rangle$, where X, A, R are defined as above; $ {\textstyle\bigoplus}: (X \times A \times X \rightarrow \mathbb{R}) \rightarrow (X
\times A \rightarrow \mathbb{R})$ is an ``expected value-type" operator and $ {\textstyle\bigotimes}: (X \times A \rightarrow \mathbb{R}) \rightarrow (X
\rightarrow \mathbb{R})$ is a ``maximization-type" operator. For example, by setting $ ({\textstyle\bigoplus}S)(x,a) = \sum_y P(x,a,y) S(x,a,y)$ and $ ({\textstyle\bigotimes}Q) (x) = \max_a Q(x,a)$ (where $ S: ( X \times A \times
X ) \to \mathbb{R}$ and $ Q: ( X \times A ) \to \mathbb{R}$), the expected-reward MDP model appears.

The task is to find a value function $ V^*$ satisfying the abstract Bellman equations:

$\displaystyle V^*(x) = {\textstyle\bigotimes}{\textstyle\bigoplus}( R(x,a,y) + \gamma V^*(y)), \quad\textrm{for all } x \in X.$    

or in short form:

$\displaystyle V^* = {\textstyle\bigotimes}{\textstyle\bigoplus}( R + \gamma V^*).$    

The optimal value function can be interpreted as the total reward received by an agent behaving optimally in a non-deterministic environment. The operator $ {\textstyle\bigoplus}$ describes the effect of the environment, i.e., how the value of taking action $ a$ in state $ x$ depends on the (non-deterministic) successor state $ y$. The operator $ {\textstyle\bigotimes}$ describes the action-selection of an optimal agent. When $ 0 \le \gamma < 1$, and both $ {\textstyle\bigoplus}$ and $ {\textstyle\bigotimes}$ are non-expansions, the optimal solution $ V^*$ of the abstract Bellman equations exists and it is unique.

The great advantage of the generalized MDP model is that a wide range of models, e.g., Markov games [Littman(1994)], alternating Markov games [Boyan(1992)], discounted expected-reward MDPs [Watkins and Dayan(1992)], risk-sensitive MDPs [Heger(1994)], exploration-sensitive MDPs [John(1994)] can be discussed in this unified framework. For details, see the work of [Szepesvári and Littman(1996)].



Subsections
next up previous
Next: Q-learning in Generalized MDPs Up: Preliminaries Previous: Markov Decision Processes with