Generalized Markov Decision Processes

Next: Q-learning in Generalized MDPs Up: Preliminaries Previous: Markov Decision Processes with

Generalized Markov Decision Processes

[Szepesvári and Littman(1996)] have introduced a more general model. Their basic concept is that in the Bellman equations, the operation $\sum_y P(x,a,y) \ldots$ (i.e., taking expected value w.r.t. the transition probabilities) describes the effect of the environment, while the operation $\max_a ...$ describes the effect of an optimal agent (i.e., selecting an action with maximum expected value). Changing these operators, other well-known models can be recovered.

A generalized MDP is defined by the tuple $\langle X, A, R, {\textstyle\bigoplus}, {\textstyle\bigotimes}\rangle$ , where X, A, R are defined as above; ${\textstyle\bigoplus}: (X \times A \times X \rightarrow \mathbb{R}) \rightarrow (X \times A \rightarrow \mathbb{R})$ is an ``expected value-type" operator and ${\textstyle\bigotimes}: (X \times A \rightarrow \mathbb{R}) \rightarrow (X \rightarrow \mathbb{R})$ is a ``maximization-type" operator. For example, by setting $({\textstyle\bigoplus}S)(x,a) = \sum_y P(x,a,y) S(x,a,y)$ and $({\textstyle\bigotimes}Q) (x) = \max_a Q(x,a)$ (where $S: ( X \times A \times X ) \to \mathbb{R}$ and $Q: ( X \times A ) \to \mathbb{R}$ ), the expected-reward MDP model appears.

The task is to find a value function satisfying the abstract Bellman equations:

$\displaystyle V^*(x) = {\textstyle\bigotimes}{\textstyle\bigoplus}( R(x,a,y) + \gamma V^*(y)), \quad\textrm{for all } x \in X.$

or in short form:

$\displaystyle V^* = {\textstyle\bigotimes}{\textstyle\bigoplus}( R + \gamma V^*).$

The optimal value function can be interpreted as the total reward received by an agent behaving optimally in a non-deterministic environment. The operator ${\textstyle\bigoplus}$ describes the effect of the environment, i.e., how the value of taking action

in state

depends on the (non-deterministic) successor state

. The operator ${\textstyle\bigotimes}$ describes the action-selection of an optimal agent. When $0 \le \gamma < 1$ , and both ${\textstyle\bigoplus}$ and ${\textstyle\bigotimes}$ are non-expansions, the optimal solution

of the abstract Bellman equations exists and it is unique.

The great advantage of the generalized MDP model is that a wide range of models, e.g., Markov games [Littman(1994)], alternating Markov games [Boyan(1992)], discounted expected-reward MDPs [Watkins and Dayan(1992)], risk-sensitive MDPs [Heger(1994)], exploration-sensitive MDPs [John(1994)] can be discussed in this unified framework. For details, see the work of [Szepesvári and Littman(1996)].

Subsections

Q-learning in Generalized MDPs

Next: Q-learning in Generalized MDPs Up: Preliminaries Previous: Markov Decision Processes with