where is selected according to the probability distribution , still finds an asymptotically near-optimal value function. To this end, we prove a lemma first:

Let us define in the following fashion:

where is sampled from distribution , and .

We make the following assumption on the generalized -MDP:

**Assumption A.**

- is a non-expansion,
- does not depend on or ,
- has a finite variance and .

where is sampled from distribution , and . Note that the only difference between and is that the successor states ( and ) are sampled from different distributions ( and ). By Lemma B.1, we can assume that , and at the same time is sampled from and from ( and are not independent from each other).

Note also that the definition in Equation 6 differs from that of Equation 5, because is not necessarily the successor state of . Fortunately, conditions of the Robbins-Monro theorem [Szepesvári(1998)] do not require this fact, so it remains true that approximates at uniformly w.p.1.

Let be an arbitrary value function, and let

Clearly, for any ,

where we used the shorthand for . Since this holds for every , it also does for their maximum, . As mentioned before, . If , . Otherwise the only applicable upper bound is . Therefore the following random sequence bounds from above: ,

where

Consequently, , which completes the proof.