next up previous
Next: Illustration: the Event-learning Algorithm Up: MDPs in Varying Environments Previous: Asymptotic Boundedness of Value

Q-learning in Generalized $ \varepsilon $-MDPs

Consider a generalized $ \varepsilon $-MDP. We assume again that $ {\textstyle\bigoplus}_t$ is an expected value operator for all $ t$, i.e., $ ({\textstyle\bigoplus}_t g)
(x,a) = \sum_y P_t(x,a,y) g(x,a,y)$. By applying Theorem 3.3, we show that the generalized Q-learning algorithm described by iteration

$\displaystyle Q_{t+1}(x_t,a_t) = (1-\alpha_t(x_t,a_t))Q_t(x_t,a_t) + \alpha_t(x_t,a_t)(r_t+\gamma ({\textstyle\bigotimes}Q_t))(\tilde{y}_t),$    

where $ \tilde{y}_t$ is selected according to the probability distribution $ P_t(x_t, a_t, .)$, still finds an asymptotically near-optimal value function. To this end, we prove a lemma first:

Let us define $ \tilde{T}_t(Q',Q)(x,a)$ in the following fashion:

$\displaystyle \tilde{T}_t(Q',Q)(x,a) = \begin{cases}(1-\alpha_t(x,a))Q'(x,a) + ...
...), & \text{if $x=x_t$\ and $a=a_t$}, \\ Q'(x,a) & \text{otherwise}, \end{cases}$ (5)

where $ \tilde{y}_t$ is sampled from distribution $ P_t(x_t, a_t, .)$, and $ x_{t+1}=\tilde{y}_t$.

We make the following assumption on the generalized $ \varepsilon $-MDP:

Assumption A.

  1. $ {\textstyle\bigotimes}$ is a non-expansion,
  2. $ {\textstyle\bigotimes}$ does not depend on $ R$ or $ P$,
  3. $ r_t$ has a finite variance and $ E(r_t \vert x_t, a_t) = R(x_t,a_t)$.
These conditions are of technical nature, and are not too restrictive: they may hold for a broad variety of environments.

Lemma 3.4   Let $ M=\max_{x,a} Q^*(x,a) - \min_{x,a} Q^*(x,a)$. If Assumption A holds, then the random operator sequence $ \tilde{T}_t$ $ \kappa$-approximates $ K$ at $ Q^*$, with $ \kappa = \gamma M
\varepsilon $.

Proof. Define the auxiliary operator sequence

$\displaystyle T_t(Q',Q)(x,a) = \begin{cases}(1-\alpha_t(x,a))Q'(x,a) + \alpha_t...
...), & \text{if $x=x_t$\ and $a=a_t$}, \\ Q'(x,a) & \text{otherwise}, \end{cases}$ (6)

where $ y_t$ is sampled from distribution $ P(x_t, a_t,
.)$, and $ x_{t+1}=\tilde{y}_t$. Note that the only difference between $ T_t$ and $ \tilde{T}_t$ is that the successor states ($ y_t$ and $ \tilde{y}_t$) are sampled from different distributions ($ P$ and $ P_t$). By Lemma B.1, we can assume that $ \Pr(y_t
= \tilde{y}_t) \ge 1-\varepsilon $, and at the same time $ y_t$ is sampled from $ P(x_t, a_t,
.)$ and $ \tilde{y}_t$ from $ P_t(x_t, a_t, .)$ ($ y_t$ and $ \tilde{y}_t$ are not independent from each other).

Note also that the definition in Equation 6 differs from that of Equation 5, because $ x_{t+1}$ is not necessarily the successor state of $ x_t$. Fortunately, conditions of the Robbins-Monro theorem [Szepesvári(1998)] do not require this fact, so it remains true that $ T_t$ approximates $ K$ at $ Q^*$ uniformly w.p.1.

Let $ Q_0 = \tilde Q_0$ be an arbitrary value function, and let

$\displaystyle Q_{t+1} = T_t(Q_t,Q^*)
$

and

$\displaystyle \tilde Q_{t+1} = \tilde T_t(\tilde Q_t,Q^*)
$

Recall that $ \kappa$-approximation means that $ \limsup_{t\to\infty} \Vert\tilde Q_t - Q^*\Vert \le \kappa$ w.p.1. Since $ \Vert\tilde Q_t - Q^*\Vert \le \Vert\tilde Q_t - Q_t\Vert + \Vert Q_t -
Q^*\Vert$ and $ \Vert Q_t - Q^*\Vert \to 0$ w.p.1, it suffices to show that $ \limsup_{t\to\infty} \Vert\tilde Q_t - Q_t\Vert \le \kappa$ w.p.1.

Clearly, for any $ (x,a)$,

$\displaystyle \vert\tilde Q_{t+1}(x,a) - Q_{t+1}(x,a)\vert$ $\displaystyle \le$ $\displaystyle \max \biggl( \max_{(x,a)\neq (x_t,a_t)} \bigl\vert\tilde Q_t(x,a) - Q_t(x,a)\bigr\vert,$  
    $\displaystyle (1-\alpha_t) \bigl\vert\tilde Q_t(x_t,a_t) - Q_t(x_t,a_t)\bigr\ve...
...e\bigotimes}Q^*(\tilde y_t) - {\textstyle\bigotimes}Q^*(y_t) \bigr\vert \biggr)$  
  $\displaystyle \le$ $\displaystyle \max \biggl( \Vert\tilde Q_{t} - Q_{t}\Vert,$  
    $\displaystyle (1-\alpha_t) \Vert\tilde Q_t - Q_t\Vert + \gamma \alpha_t \Vert Q^*(\tilde y_t, .) - Q^*(y_t,.) \Vert \biggr),$  

where we used the shorthand $ \alpha_t$ for $ \alpha_t(x_t,a_t)$. Since this holds for every $ \vert\tilde Q_{t+1}(x,a) - Q_{t+1}(x,a)\vert$, it also does for their maximum, $ \Vert\tilde Q_{t+1} - Q_{t+1}\Vert$. As mentioned before, $ \Pr(\tilde y_t = y_t)\ge 1-\varepsilon $. If $ \tilde{y}_t = y_t$, $ \Vert Q^*(\tilde y_t, .) - Q^*(y_t,.)\Vert = 0$. Otherwise the only applicable upper bound is $ M$. Therefore the following random sequence bounds $ \Vert\tilde Q_{t} - Q_{t}\Vert$ from above: $ a_0 := 0$,

$\displaystyle a_{t+1} := (1-\alpha_t) a_t + \alpha_t h_t,$ (7)

where

$\displaystyle h_t =
\begin{cases}
\gamma M & \text{with probability $\varepsilon $}, \\
0 & \text{with probability $1-\varepsilon $}.
\end{cases}$

It is easy to see that Equation 7 is a Robbins-Monro-like iterated averaging [Robbins and Monro(1951),Jaakkola et al.(1994)]. Therefore $ a_t$ converges to $ E(h_t) = \gamma M \varepsilon $ w.p.1.

Consequently, $ \limsup_{t\to\infty} \Vert\tilde Q_{t} - Q_{t}\Vert \le
\gamma M \varepsilon $, which completes the proof. $ \qedsymbol$

Theorem 3.5   Let $ Q^*$ be the optimal value function of the base MDP of the generalized $ \varepsilon $-MDP, and let $ M=\max_{x,a} Q^*(x,a) - \min_{x,a} Q^*(x,a)$. If Assumption A holds, then $ \limsup_{t\to\infty} \Vert Q_t
- Q^*\Vert \le \frac{2}{1-\gamma}\gamma M \varepsilon $ w.p.1, i.e., the sequence $ Q_t$ $ \kappa'$-approximates the optimal value function with $ \kappa' = \frac{2}{1-\gamma}\gamma M \varepsilon $.

Proof. By Lemma 3.4, the operator sequence $ \kappa$-approximates $ K$ (defined in Equation 2) at $ Q^*$ with $ \kappa = \gamma M
\varepsilon $. The proof can be finished analogously to the proof of Corollary A.3: with the substitution $ X \leftarrow
X \times A$, and defining $ G_t$ and $ F_t$ as in Equations 11 and 12, the conditions of Theorem 3.3 hold, thus proving our statement. $ \qedsymbol$


next up previous
Next: Illustration: the Event-learning Algorithm Up: MDPs in Varying Environments Previous: Asymptotic Boundedness of Value