next up previous
Next: Q-learning in Generalized -MDPs Up: MDPs in Varying Environments Previous: Generalized -MDPs


Asymptotic Boundedness of Value Iteration

In this section we prove a generalized form of the convergence theorem of Szepesvári and Littman's (for the original theorem, see Appendix A). We do not require probability one uniform convergence of the approximating operators, but only a sufficiently close approximation. Therefore the theorem can be applied to prove results about algorithms in generalized $ \varepsilon $-MDPs. Our definition of closeness both for value functions and dynamic-programming operators is given below.

Let $ X$ be an arbitrary state space and denote by $ \mathbf{B} (X)$ the set of value functions. Let $ T: \mathbf{B}(X) \to
\mathbf{B}(X)$ be an arbitrary contraction mapping with unique fixed point $ V^*$, and let $ T_t: \mathbf{B}(X) \times
\mathbf{B}(X) \to \mathbf{B}(X)$ be a sequence of random operators.

Definition 3.1   A series of value functions $ V_t$ $ \kappa$-approximates $ V$ with $ \kappa>0$, if $ \limsup_{t\rightarrow\infty} \left\Vert V_t-V\right\Vert \le
\kappa$ with probability one.

Definition 3.2   We say that $ T_t$ $ \kappa$-approximates $ T$ at $ V$ over $ X$, if for any $ V_0$ and for $ V_{t+1}=T_t(V_t,V)$, $ V_t$ $ \kappa$-approximates $ T V$ over $ X$ with probability one.

Note that $ T_t$ may depend on the approximated value function $ V$, unlike the previous example in Equation 4. $ \kappa$-approximation of value functions is, indeed, weaker (more general) than probability one uniform convergence: the latter means that for all $ \varepsilon , \delta>0$ there exists a $ t_T$ such that

$\displaystyle \Pr( \sup_{t\ge t_T}(\left\Vert V_t-V\right\Vert )<\delta ) > 1-\varepsilon ,$

whereas an equivalent form of $ \kappa$-approximation is that for all $ \varepsilon >0$ there exists a $ T$ such that

$\displaystyle \Pr( \sup_{t\ge t_T}(\left\Vert V_t-V\right\Vert )< \kappa ) > 1-\varepsilon ,$

and $ \kappa$ is fixed.

Theorem 3.3   Let $ T$ be an arbitrary mapping with fixed point $ V^*$, and let $ T_t$ $ \kappa$-approximate $ T$ at $ V^*$ over $ X$. Let $ V_0$ be an arbitrary value function, and define $ V_{t+1} = T_t(V_t, V_t)$. If there exist functions $ 0 \leq F_t(x) \leq 1$ and $ 0 \leq G_t(x)
\leq 1$ satisfying the conditions below with probability one
  1. for all $ U_1, U_2 \in \mathcal{V}$ and all $ x\in X$,

    $\displaystyle \big\vert T_t(U_1,V^*)(x)-T_t(U_2,V^*)(x) \big\vert \leq G_t(x) \big\vert U_1(x) - U_2(x) \big\vert
$

  2. for all $ U, V \in \mathcal{V}$ and all $ x\in X$,

    $\displaystyle \big\vert T_t(U,V^*)(x)-T_t(U,V)(x) \big\vert \leq F_t(x) \sup_{x'} \big\vert V^*(x') - V(x') \big\vert
$

  3. for all $ k > 0$, $ \prod_{t=k}^n G_t(x)$ converges to zero uniformly in $ x$ as $ n$ increases;2 and,
  4. there exists $ 0 \leq \gamma < 1$ such that for all $ x\in X$ and sufficiently large $ t$,

    $\displaystyle F_t(x)\leq \gamma(1-G_t(x)) \textrm{ w.p.1.}$

then $ V_t$ $ \kappa'$-approximates $ V^*$ over $ X$, where $ \kappa'
= \frac{2}{1-\gamma} \kappa$.

Proof. The proof is similar to that of the original theorem. First we define the sequence of auxiliary functions $ U_t$ by the recursion $ U_0 = V_0$, $ U_{t+1} = T_t (U_t, V^*)$. Since $ T_t$ $ \kappa$-approximates $ T$ at $ V^*$, $ \limsup_{t\to\infty} \Vert U_t -
V^* \Vert \le \kappa$, i.e., for sufficiently large $ t_0$ and $ t \ge
t_0$, $ \Vert U_t - V^* \Vert \le \kappa$. Let

$\displaystyle \delta_t(x) = \vert U_{t}(x) - V_t(x)\vert.
$

For $ t \ge
t_0$ we have
$\displaystyle \delta_{t+1}(x)$ $\displaystyle =$ $\displaystyle \vert U_{t+1}(x) - V_{t+1}(x)\vert$  
  $\displaystyle =$ $\displaystyle \vert T_t(U_t,V^*)(x) - T_t(V_t,V_t)(x)\vert$  
  $\displaystyle \le$ $\displaystyle \vert T_t(U_t,V^*)(x) - T_t(V_t,V^*)(x)\vert + \vert T_t(V_t,V^*)(x) - T_t(V_t,V_t)(x)\vert$  
  $\displaystyle \le$ $\displaystyle G_t(x) \vert U_t(x)-V_t(x)\vert + F_t(x) \left\Vert V^*-V_t\right\Vert$  
  $\displaystyle \le$ $\displaystyle G_t(x) \vert U_t(x)-V_t(x)\vert + F_t(x) (\left\Vert V^*-U_t\right\Vert + \left\Vert U_t-V_t\right\Vert )$  
  $\displaystyle \le$ $\displaystyle G_t(x) \delta_t(x) + F_t(x) (\kappa + \left\Vert\delta_t\right\Vert )$  

Then, by Lemma C.1 (found in the appendix), we get that $ \limsup_{t\to\infty} \Vert \delta_t \Vert \le
\frac{1+\gamma}{1-\gamma} \kappa$. From this, $ \limsup_{t\to\infty} \Vert V_t - V^* \Vert \le \frac{1+\gamma}{1-\gamma}
\kappa + \kappa = \frac{2}{1-\gamma} \kappa$. $ \qedsymbol$


next up previous
Next: Q-learning in Generalized -MDPs Up: MDPs in Varying Environments Previous: Generalized -MDPs