Event-Learning with the SDS Controller

Next: Computational Demonstrations: The Two-link Up: Illustration: the Event-learning Algorithm Previous: Formalizing Event-Learning in the

Event-Learning with the SDS Controller

In this subsection we review a particular controller for continuous dynamical systems, the static and dynamic state (SDS) feedback controller proposed by Lorincz and colleagues [Szepesvári et al.(1997),Szepesvári and Lorincz(1997)], for more details see the Appendix. It is shown that it can be easily inserted into the E-learning scheme.

The SDS control scheme gives a solution to the control problem called speed field tracking⁵ (SFT) in continuous dynamical systems [Hwang and Ahuja(1992),Fomin et al.(1997,Szepesvári and Lorincz(1998)]. The problem is the following. Assume that a state space and a velocity field $v^d: X \to \dot{X}$ are given. At time , the system is in state with velocity . We are looking for a control action that modifies the actual velocity to . The obvious solution is to apply an inverse dynamics, i.e., to apply the control signal in state which drives the system into with maximum probability:

$\displaystyle u_t( x_t,v^d_t) = \Phi(x_t, v^d_t)$

Of course, the inverse dynamics $\Phi(x_t, v^d_t)$ has to be determined some way, for example by exploring the state space first.

The SDS controller provides an approximate solution such that the tracking error, i.e., $\Vert v^d(x_t) - v_t\Vert$ is bounded, and this bound can be made arbitrarily small. This represents considerable advantage over approximations of the inverse dynamics, which can be unbounded and therefore may lead to instabilities when used in E-learning.

Studies on SDS showed that it is robust, i.e., capable of solving the SFT problem with a bounded, prescribed tracking error [Fomin et al.(1997),Szepesvári et al.(1997),Szepesvári and Lorincz(1997),Szepesvári(1998)]. Moreover, it has been shown to be robust also against perturbation of the dynamics of the system and discretization of the state space [Lorincz et al.(2002)]. The SDS controller fits real physical problems well, where the variance of the velocity field is moderate.

The SDS controller applies an approximate inverse dynamics $\hat{\Phi}$ , which is then corrected by a feedback term (for the sake of convenience, we use the shorthand ). The output of the SDS controller is

$\displaystyle u_t( x_t,v^d_t) = \hat{\Phi}(x_t, v^d_t) + \Lambda \int_{\tau=0}^{t} w_\tau d\tau,$

where

$\displaystyle w_\tau = \hat{\Phi}(x_\tau, v^d_\tau) - \hat{\Phi}(x_\tau,v_\tau)$

is the correction term, and $\Lambda > 0$ is the gain of the feedback. It was shown that under appropriate conditions, the eventual tracking error of the controller is bounded by $O(1/\Lambda)$ . The assumptions on the approximate inverse dynamics are quite mild: only sign-properness is required [Szepesvári et al.(1997),Szepesvári and Lorincz(1997)].⁶Generally, such an approximate inverse dynamics is easy to construct either by explicit formulae or by observing the dynamics of system during learning.

The above described controller cannot be applied directly to E-learning, because continuous time and state descriptions are used. Therefore we have to discretize the state space, and this discretization should satisfy the condition of `sign-properness'. Furthermore, we assume that the dynamics of the system is such that for sufficiently small time steps all conditions of the SDS controller are satisfied.⁷ Note that if time is discrete, then prescribing desired velocity is equivalent to prescribing a desired successor state [Lorincz et al.(2002)]. Therefore the controller takes the form

$\displaystyle u_t( x_t,y^d_t) = \hat{\Phi}(x_t, y^d_t) + \Lambda \sum_{\tau=0}^{t} w_\tau \cdot \Delta t ,$

where

$\displaystyle w_\tau = \hat{\Phi}(x_\tau, y^d_\tau) - \hat{\Phi}(x_\tau,y_\tau),$

and $\Delta t$ denotes the size of the time steps. Note that $x_\tau$ and $y_\tau$ (therefore $w_\tau$ ) change at discretization boundaries only, i.e., when an event is observed. Therefore, event-learning with the SDS controller has more relaxed conditions on update rate than other reinforcement learning methods [Lorincz et al.(2002)].

The above defined controller can be directly inserted into event-learning by setting

$\displaystyle \pi^A_t(x_t, y^d_t, a) = \begin{cases}1 & \text{if $a=u_t( x_t,y^d_t)$}, \\ 0 & \text{otherwise}. \end{cases}$

(10)

Note that the action space is still infinite.

Corollary 4.3 Assume that the environment is such that $\sum_y \vert P(x,u_1,y) - P(x,u_2,y)\vert \le K \Vert u_1 - u_2\Vert$ for all

.⁸ Let $\varepsilon$ be a prescribed number. For sufficiently large $\Lambda$ and sufficiently small time steps, the SDS controller described in Equation 10 and the environment form an $\varepsilon$ -MDP.

The proof can be found in Appendix D.

Consequently, Theorem 3.5 is applicable.

Next: Computational Demonstrations: The Two-link Up: Illustration: the Event-learning Algorithm Previous: Formalizing Event-Learning in the