** Next:** Event-Learning with the SDS
** Up:** Illustration: the Event-learning Algorithm
** Previous:** Formal Description of Event-learning

##

Formalizing Event-Learning in the Generalized
-MDP
Framework

In most applications we cannot assume that a time-independent
optimal controller policy exists. To the contrary, we may have to
allow the controller policy to adapt over time. In this case, we
may try to require asymptotic near-optimality. This is a more
realistic requirement: in many cases it can be fulfilled, e.g., by
learning an approximate inverse dynamics
[Fomin et al.(1997)] in parallel with E-learning. Or
alternatively, the controller policy itself may be subject to
reinforcement learning (with a finer state space resolution), thus
defining a modular hierarchy. Another attractive solution is the
application of a robust controller like the SDS controller
[Szepesvári and Lorincz(1997)], which is proven to be
asymptotically near-optimal. Furthermore, the SDS controller has
short adaptation time, and is robust against perturbations of the
environment.

As a consequence of the varying environment (recall that from the
viewpoint of E-learning, the controller policy is the part of the
environment), we cannot prove convergence any more. But we may
apply Theorem 3.3 to show that
there exists an iteration which
still finds a near-optimal event-value function. To this end, we
have to re-formulate E-learning in the generalized
-MDP framework.

One can define a generalized
-MDP such that its generalized
Q-learning algorithm is identical to our E-learning algorithm: Let
the state set and the transition probabilities of the E-learning
algorithm be defined by and , respectively. In the new
generalized
-MDP the set of states will also be , and since
an action of the learning agent is selecting a new desired state,
the set of actions is also equal to . (Note that because of
this assignment, the generalized Q-value function of this model
will be exactly our event-value function .) The definition of
the reward function is evident:
gives the reward
for arriving at from , when the desired state was .
Let
, independently of ,
and let
, where
(see Equation 9).

Finally, we assign the operators
and
as
and
.

The generalized Q-learning algorithm of this model uses the
iteration

This is identical to the iteration defined by
[Lorincz et al.(2002)]. Here is the sequence containing the
realized states at time instant t and is the sequence
containing the desired state for time instant .

** Next:** Event-Learning with the SDS
** Up:** Illustration: the Event-learning Algorithm
** Previous:** Formal Description of Event-learning