Q-learning in Generalized -MDPs

Proof. Define the auxiliary operator sequence

$\displaystyle T_t(Q',Q)(x,a) = \begin{cases}(1-\alpha_t(x,a))Q'(x,a) + \alpha_t... ...), & \text{if $x=x_t$\ and $a=a_t$}, \\ Q'(x,a) & \text{otherwise}, \end{cases}$

(6)

where

is sampled from distribution

, and $x_{t+1}=\tilde{y}_t$ . Note that the only difference between

and $\tilde{T}_t$ is that the successor states (

and $\tilde{y}_t$ ) are sampled from different distributions (

and

). By Lemma B.1, we can assume that $\Pr(y_t = \tilde{y}_t) \ge 1-\varepsilon$ , and at the same time

is sampled from

and $\tilde{y}_t$ from

(

and $\tilde{y}_t$ are not independent from each other).

Note also that the definition in Equation 6 differs from that of Equation 5, because $x_{t+1}$ is not necessarily the successor state of . Fortunately, conditions of the Robbins-Monro theorem [Szepesvári(1998)] do not require this fact, so it remains true that approximates at uniformly w.p.1.

Let $Q_0 = \tilde Q_0$ be an arbitrary value function, and let

$\displaystyle Q_{t+1} = T_t(Q_t,Q^*)$

and

$\displaystyle \tilde Q_{t+1} = \tilde T_t(\tilde Q_t,Q^*)$

Recall that $\kappa$ -approximation means that $\limsup_{t\to\infty} \Vert\tilde Q_t - Q^*\Vert \le \kappa$ w.p.1. Since $\Vert\tilde Q_t - Q^*\Vert \le \Vert\tilde Q_t - Q_t\Vert + \Vert Q_t - Q^*\Vert$ and $\Vert Q_t - Q^*\Vert \to 0$ w.p.1, it suffices to show that $\limsup_{t\to\infty} \Vert\tilde Q_t - Q_t\Vert \le \kappa$ w.p.1.

Clearly, for any ,

$\displaystyle \vert\tilde Q_{t+1}(x,a) - Q_{t+1}(x,a)\vert$	$\displaystyle \le$	$\displaystyle \max \biggl( \max_{(x,a)\neq (x_t,a_t)} \bigl\vert\tilde Q_t(x,a) - Q_t(x,a)\bigr\vert,$
		$\displaystyle (1-\alpha_t) \bigl\vert\tilde Q_t(x_t,a_t) - Q_t(x_t,a_t)\bigr\ve... ...e\bigotimes}Q^(\tilde y_t) - {\textstyle\bigotimes}Q^(y_t) \bigr\vert \biggr)$
	$\displaystyle \le$	$\displaystyle \max \biggl( \Vert\tilde Q_{t} - Q_{t}\Vert,$
		$\displaystyle (1-\alpha_t) \Vert\tilde Q_t - Q_t\Vert + \gamma \alpha_t \Vert Q^(\tilde y_t, .) - Q^(y_t,.) \Vert \biggr),$

where we used the shorthand $\alpha_t$ for $\alpha_t(x_t,a_t)$ . Since this holds for every $\vert\tilde Q_{t+1}(x,a) - Q_{t+1}(x,a)\vert$ , it also does for their maximum, $\Vert\tilde Q_{t+1} - Q_{t+1}\Vert$ . As mentioned before, $\Pr(\tilde y_t = y_t)\ge 1-\varepsilon$ . If $\tilde{y}_t = y_t$ , $\Vert Q^*(\tilde y_t, .) - Q^*(y_t,.)\Vert = 0$ . Otherwise the only applicable upper bound is

. Therefore the following random sequence bounds $\Vert\tilde Q_{t} - Q_{t}\Vert$ from above:

$\displaystyle a_{t+1} := (1-\alpha_t) a_t + \alpha_t h_t,$

(7)

where

$\displaystyle h_t = \begin{cases} \gamma M & \text{with probability $\varepsilon $}, \\ 0 & \text{with probability $1-\varepsilon $}. \end{cases}$

It is easy to see that Equation 7 is a Robbins-Monro-like iterated averaging [Robbins and Monro(1951),Jaakkola et al.(1994)]. Therefore

converges to $E(h_t) = \gamma M \varepsilon$ w.p.1.

Consequently, $\limsup_{t\to\infty} \Vert\tilde Q_{t} - Q_{t}\Vert \le \gamma M \varepsilon$ , which completes the proof. $\qedsymbol$