In this section we propose an extension of the MDP concept, where transition probabilities are allowed to vary by time. However, without restrictions, such a model would be too general to establish useful theorems. Therefore we restrict ourselves to cases when the variation over time remains small.
We say that the distance of two transition functions and
is
-small (
), if
for all
, i.e.,
for all
and subscript
denotes the
norm. (Note that for a given state
and
action
,
is a probability distribution over
.)
A tuple
is an
-stationary MDP [Kalmár et al.(1998)] with
, if there exists an MDP
(called the base MDP) such that the difference of the
transition functions
and
is
-small for all
.
The simplest example of an
-MDP is possibly an ordinary MDP,
superimposed by additive noise in its transition function as
in each step, where
is a
small amount of noise. A more general case will be discussed
within the event-learning framework.
In the ordinary MDP model, the dynamic programming operator is
used to find the optimal value function
.
at time step
is given by
Of course, the iteration
does not necessarily
have a fixed point. The most that we can expect to find is a close
approximation of the optimal value function of the base MDP, i.e.,
a
such that
with some
.1
In Section 3.2 we show that such an approximation
with
can be found
(Theorem 3.3).