next up previous
Next: Generalized -MDPs Up: -MDPs: Learning in Varying Previous: Q-learning in Generalized MDPs


MDPs in Varying Environments

In this section we propose an extension of the MDP concept, where transition probabilities are allowed to vary by time. However, without restrictions, such a model would be too general to establish useful theorems. Therefore we restrict ourselves to cases when the variation over time remains small.

We say that the distance of two transition functions $ P$ and $ P'$ is $ \varepsilon $-small ( $ \varepsilon >0$), if $ \Vert P(x,a,.) - P'(x,a,.)
\Vert _{L_1} \leq \varepsilon $ for all $ (x,a)$, i.e., $ \sum_y \vert P(x,a,y)
- P'(x,a,y) \vert \leq \varepsilon $ for all $ (x,a)$ and subscript $ L_1$ denotes the $ L_1$ norm. (Note that for a given state $ x$ and action $ a$, $ P(x,a,y)$ is a probability distribution over $ y \in
X$.)

A tuple $ \langle X, A, \{P_t\}, R \rangle$ is an $ \varepsilon $-stationary MDP [Kalmár et al.(1998)] with $ \varepsilon >0$, if there exists an MDP $ \langle X, A, P, R
\rangle$ (called the base MDP) such that the difference of the transition functions $ P$ and $ P_t$ is $ \varepsilon $-small for all $ t=1,2,3,\ldots$.

The simplest example of an $ \varepsilon $-MDP is possibly an ordinary MDP, superimposed by additive noise in its transition function as $ P'(x,a,y) = P(x,a,y) + \delta$ in each step, where $ \delta$ is a small amount of noise. A more general case will be discussed within the event-learning framework.

In the ordinary MDP model, the dynamic programming operator $ T$ is used to find the optimal value function $ V^*$. $ T$ at time step $ t$ is given by

$\displaystyle [T_tV](x) = \max_a \sum_y P_t(x,a,y) \left( R(x,a,y) + \gamma V(y) \right).$ (4)

Of course, the iteration $ V_{t+1} = T_t V_t$ does not necessarily have a fixed point. The most that we can expect to find is a close approximation of the optimal value function of the base MDP, i.e., a $ \hat{V}$ such that $ \Vert\hat{V} - V^*\Vert < \varepsilon '$ with some $ \varepsilon '>0$.1

In Section 3.2 we show that such an approximation with % latex2html id marker 3892
$ \varepsilon ' \propto \varepsilon $ can be found (Theorem 3.3).



Subsections
next up previous
Next: Generalized -MDPs Up: -MDPs: Learning in Varying Previous: Q-learning in Generalized MDPs