It is an easy consequence of Theorem A.1 that under appropriate conditions, the Q-learning algorithm is convergent.

Consider the substitution , and . Furthermore, let the randomized approximate dynamic programming operator sequence defined by

which approximates at w.p.1 over under the assumptions (2)-(4) by the well-known Robbins-Monro theorem [Robbins and Monro(1951)].

Finally, let

and

Condition (4) of Theorem A.1 trivially holds, while conditions (1) and (2) are implied by the definition of and the non-expansion property of . Finally, condition (3) is a consequence of assumption (3) and the definition of .

Applying Theorem A.1 proves the statement.