next up previous
Next: MDPs in Varying Environments Up: Generalized Markov Decision Processes Previous: Generalized Markov Decision Processes

Q-learning in Generalized MDPs

As shown by Szepesvári and Littman, the analogue of the Q-learning algorithm (Section 2.1) can be defined in generalized MDPs as well [Szepesvári and Littman(1996)]. Furthermore, convergence results for this general algorithm can also be established.

In this subsection we review the generalized algorithm. We restrict ourselves to models where operator $ {\textstyle\bigoplus}$ is the expected value operator, i.e., $ ({\textstyle\bigoplus}g) (x,a) = \sum_y
P(x,a,y) g(x,a,y)$. This simplifies the definition considerably, and for the purposes of this article this special case is sufficient. The general definition can be found in the work of [Szepesvári and Littman(1996)].

In Q-learning, for the optimal state-action value function $ Q^*$, $ Q^* = {\textstyle\bigoplus}(R + \gamma V^*)$ holds. Note that $ Q^*$ is the fixed point of operator $ K$, which is defined by

$\displaystyle KQ = {\textstyle\bigoplus}(R + \gamma {\textstyle\bigotimes}Q).$ (2)

The generalized Q-learning algorithm starts with an arbitrary initial value function $ Q_0$, and then uses the following update rule:

$\displaystyle Q_{t+1}(x_t,a_t) = (1-\alpha_t(x_t,a_t))Q_t(x_t,a_t) + \alpha_t(x_t,a_t)(r_t+\gamma ({\textstyle\bigotimes}Q_t)(y_t)),$ (3)

where $ x_t$ is the current state, $ a_t$ is the selected action, $ y_t$ is the resulting state, $ r_t$ is the gained reward, and $ \alpha_t(x,a)$ is the learning rate at time $ t$. $ y_t$ is selected according to the probability distribution $ P(x_t, a_t,
.)$ and $ Q_t$ is the actual estimate of $ Q^*$.

It has been proven that under appropriate conditions on the order of updates and the learning parameters $ \alpha_t(x,a)$, the above Q-learning algorithm converges to the optimal Q-value function. The proof is based on a theorem that states the convergence of a specific asynchronous stochastic approximation. Both theorems are cited in Appendix A.


next up previous
Next: MDPs in Varying Environments Up: Generalized Markov Decision Processes Previous: Generalized Markov Decision Processes