Generalized -MDPs

Next: Asymptotic Boundedness of Value Up: MDPs in Varying Environments Previous: MDPs in Varying Environments

Generalized $\varepsilon$ -MDPs

As with regular MDPs, $\varepsilon$ -stationary MDPs can also be generalized with general environment and agent operators. The resulting model inherits the advantages of both approaches of generalization: a broad scale of decision problems can be discussed simultaneously, while the underlying environment is allowed to change over time as well. This family of MDPs will be called generalized $\varepsilon$ -stationary MDPs or $\varepsilon$ -MDPs for short.

Given a prescribed $\varepsilon >0$ , a generalized $\varepsilon$ -MDP is defined by the tuple $\langle X, A, R, \{{\textstyle\bigoplus}_t\}, \{{\textstyle\bigotimes}_t\} \rangle$ , with ${\textstyle\bigoplus}_t: (X \times A \times X \rightarrow \mathbb{R}) \rightarrow (X \times A \rightarrow \mathbb{R})$ and ${\textstyle\bigotimes}_t: (X \times A \rightarrow \mathbb{R}) \rightarrow (X \rightarrow \mathbb{R})$ , $t=1,2,3,\ldots$ , if there exists a generalized MDP $\langle X, A, R, {\textstyle\bigoplus}, {\textstyle\bigotimes}\rangle$ such that $\limsup_{t\to\infty} \left\Vert{\textstyle\bigotimes}_t {\textstyle\bigoplus}_t - {\textstyle\bigotimes}{\textstyle\bigoplus}\right\Vert \le \varepsilon$ . Note that the last assumption requires that the asymptotic distance of the corresponding dynamic-programming operator sequence and is small.

Note also, that the given definition is indeed a generalization of both concepts: setting $\varepsilon = 0$ , ${\textstyle\bigoplus}_t = {\textstyle\bigoplus}$ and ${\textstyle\bigotimes}_t = {\textstyle\bigotimes}$ for all yields a generalized MDP, while setting $({\textstyle\bigoplus}_t S) (x,a) = \sum_y P_t(x,a,y) S(x,a,y)$ and $({\textstyle\bigotimes}_t Q) (x) = \max_a Q(x,a)$ for all simplifies to an $\varepsilon$ -stationary MDP.