next up previous
Next: Formalizing Event-Learning in the Up: Illustration: the Event-learning Algorithm Previous: Illustration: the Event-learning Algorithm

Formal Description of Event-learning

Similarly to most other RL algorithms, the E-learning algorithm [Lorincz et al.(2002)] also uses a value function, the event-value function $ E: X\times X \to \mathbb{R}$. Pairs of states $ (x,y)$ and $ (x,y^d)$ are called events and desired events, respectively. For a given initial state $ x$, let us denote the desired next state by $ y^d$. The $ e_d=(x,y^d)$ state sequence is the desired event, or subtask, $ E(x,y^d)$ is the value of trying to get from actual state $ x$ to next state $ y^d$. State $ y$ reached by the subtask could be different from the desired state $ y^d$. One of the advantages of this formulation is that one may--but does not have to--specify the transition time: Realizing the subtask may take more than one step for the controller, which is working in the background.

This value may be different from the expected discounted total reward of eventually getting from $ x$ to $ y^d$. We use the former definition, since we want to use the event-value function for finding an optimal successor state. To this end, the event-selection policy $ \pi^E : X \times X \to [0,1]$ is introduced. $ \pi^E(x,y^d)$ gives the probability of selecting desired state $ y^d$ in state $ x$. However, the system usually cannot be controlled by ``wishes" (desired new states); decisions have to be expressed in actions. This is done by the action-selection policy (or controller policy) $ \pi^A : X \times X \times A \to [0,1]$, where $ \pi^A(x,y^d,u)$ gives the probability that the agent selects action $ u$ to realize the transition $ x \to y^d$.4

An important property of event learning is the following: only the event-selection policy is learned (through the event-value function) and the learning problem of the controller's policy is separated from E-learning. From the viewpoint of E-learning, the controller policy is part of the environment, just like the transition probabilities.

The event-value function corresponding to a given action selection policy can be expressed by the state value function:

$\displaystyle E_{\pi^E,\pi^A}(x,y^d) = \sum_u \pi^A(x,y^d,u) \sum_y P(x,u,y) \Big( R(x,y) + \gamma V_{\pi^E,\pi^A}(y) \Big),$    

and conversely:

$\displaystyle V_{\pi^E, \pi^A}(x) = \sum_{y^d} \pi^E(x,y^d) E_{\pi^E,\pi^A}(x,y^d).$    

From the last two equations the recursive formula

$\displaystyle E_{\pi^E,\pi^A}(x,y^d) = \sum_u \pi^A(x,y^d,u) \sum_y P(x,u,y) \left( R(x,y) + \gamma \sum_{z^d} \pi^E(y,z^d) E_{\pi^E,\pi^A}(y, z^d) \right)$ (8)

can be derived. Denote by $ p(y\vert x,y^d)$ the probability that given the initial state $ x$ and goal state $ y^d$, the controller and the environment drive the system to state $ y$ in one step. Clearly,

$\displaystyle p(y\vert x,y^d) = \sum_u \pi^A(x,y^d,u) P(x, u, y).$ (9)

Note that in an on-line process $ s_1, s_1^d, r_1, \ldots, s_t,
s_t^d, r_t, \ldots$ (where $ s_t$, $ s_t^d$ and $ r_t$ denote the actual state, the desired (planned) state and the immediate reward at time step $ t$), the state $ s_{t+1}$ is sampled from distribution $ p(.\vert s_t,s_t^d)$. Using Equation 9, Equation 8 can be rewritten in the following form:

$\displaystyle E_{\pi^E,\pi^A}(x,y^d) = \sum_{y} p(y\vert x,y^d) \left(R(x,y) + \gamma \sum_{z^d} \pi^E(y,z^d) E_{\pi^E,\pi^A}(y,z^d)\right).$    

Definition 4.1   For a fixed controller policy $ \pi^A$, an event-value function $ E^*_{\pi^A}$ is optimal if it satisfies

$\displaystyle E^*_{\pi^A}(x,y^d) \ge E_{\pi^E,\pi^A}(x,y^d)
$

for all $ x, y^d$ and $ \pi^E$.

It can be shown that $ E^*_{\pi^A}$ satisfies the following Bellman-type equations:

$\displaystyle E^*_{\pi^A}(x,y^d)$ $\displaystyle =$ $\displaystyle \sum_u \pi^A(x,y^d,u) \sum_y P(x,u,y) \Big( R(x,y) + \gamma V^*_{\pi^A}(y) \Big),$    where  
$\displaystyle V^*_{\pi_A}(x)$ $\displaystyle =$ $\displaystyle \max_{z^d} E^*_{\pi^A}(y,z^d)$  

Remark 4.2   It is easy to see that $ \max_{\pi^A} V^*_{\pi^A}(x) = V^*(x)$. A controller policy $ \pi^A_*$ is optimal, if it maximizes the l.h.s. of the expression.

An optimal event-value function with respect to an optimal controller policy will be denoted by $ E^*$.


next up previous
Next: Formalizing Event-Learning in the Up: Illustration: the Event-learning Algorithm Previous: Illustration: the Event-learning Algorithm