next up previous
Next: Asymptotic Boundedness of Value Up: MDPs in Varying Environments Previous: MDPs in Varying Environments

Generalized $ \varepsilon $-MDPs

As with regular MDPs, $ \varepsilon $-stationary MDPs can also be generalized with general environment and agent operators. The resulting model inherits the advantages of both approaches of generalization: a broad scale of decision problems can be discussed simultaneously, while the underlying environment is allowed to change over time as well. This family of MDPs will be called generalized $ \varepsilon $-stationary MDPs or $ \varepsilon $-MDPs for short.

Given a prescribed $ \varepsilon >0$, a generalized $ \varepsilon $-MDP is defined by the tuple $ \langle X, A, R, \{{\textstyle\bigoplus}_t\}, \{{\textstyle\bigotimes}_t\}
\rangle$, with $ {\textstyle\bigoplus}_t: (X \times A \times X \rightarrow
\mathbb{R}) \rightarrow (X \times A \rightarrow \mathbb{R})$ and $ {\textstyle\bigotimes}_t:
(X \times A \rightarrow \mathbb{R}) \rightarrow (X \rightarrow \mathbb{R})$, $ t=1,2,3,\ldots$, if there exists a generalized MDP $ \langle X, A, R,
{\textstyle\bigoplus}, {\textstyle\bigotimes}\rangle$ such that $ \limsup_{t\to\infty}
\left\Vert{\textstyle\bigotimes}_t {\textstyle\bigoplus}_t - {\textstyle\bigotimes}{\textstyle\bigoplus}\right\Vert \le \varepsilon $. Note that the last assumption requires that the asymptotic distance of the corresponding dynamic-programming operator sequence $ T_t$ and $ T$ is small.

Note also, that the given definition is indeed a generalization of both concepts: setting $ \varepsilon = 0$, $ {\textstyle\bigoplus}_t = {\textstyle\bigoplus}$ and $ {\textstyle\bigotimes}_t = {\textstyle\bigotimes}$ for all $ t$ yields a generalized MDP, while setting $ ({\textstyle\bigoplus}_t S) (x,a) = \sum_y P_t(x,a,y) S(x,a,y)$ and $ ({\textstyle\bigotimes}_t Q) (x) = \max_a Q(x,a)$ for all $ t$ simplifies to an $ \varepsilon $-stationary MDP.