Formal Description of Event-learning

Similarly to most other RL algorithms, the E-learning algorithm [Lorincz et al.(2002)] also uses a value function, the event-value function $E: X\times X \to \mathbb{R}$ . Pairs of states

and

are called events and desired events, respectively. For a given initial state

, let us denote the desired next state by

. The

state sequence is the desired event, or subtask,

is the value of trying to get from actual state

to next state

. State

reached by the subtask could be different from the desired state

. One of the advantages of this formulation is that one may--but does not have to--specify the transition time: Realizing the subtask may take more than one step for the controller, which is working in the background.

This value may be different from the expected discounted total reward of eventually getting from

. We use the former definition, since we want to use the event-value function for finding an optimal successor state. To this end, the event-selection policy $\pi^E : X \times X \to [0,1]$ is introduced. $\pi^E(x,y^d)$ gives the probability of selecting desired state

in state

. However, the system usually cannot be controlled by ``wishes" (desired new states); decisions have to be expressed in actions. This is done by the action-selection policy (or controller policy) $\pi^A : X \times X \times A \to [0,1]$ , where $\pi^A(x,y^d,u)$ gives the probability that the agent selects action

to realize the transition $x \to y^d$ .⁴

An important property of event learning is the following: only the event-selection policy is learned (through the event-value function) and the learning problem of the controller's policy is separated from E-learning. From the viewpoint of E-learning, the controller policy is part of the environment, just like the transition probabilities.

The event-value function corresponding to a given action selection policy can be expressed by the state value function:

$\displaystyle E_{\pi^E,\pi^A}(x,y^d) = \sum_u \pi^A(x,y^d,u) \sum_y P(x,u,y) \Big( R(x,y) + \gamma V_{\pi^E,\pi^A}(y) \Big),$

$\displaystyle V_{\pi^E, \pi^A}(x) = \sum_{y^d} \pi^E(x,y^d) E_{\pi^E,\pi^A}(x,y^d).$

$\displaystyle E_{\pi^E,\pi^A}(x,y^d) = \sum_u \pi^A(x,y^d,u) \sum_y P(x,u,y) \left( R(x,y) + \gamma \sum_{z^d} \pi^E(y,z^d) E_{\pi^E,\pi^A}(y, z^d) \right)$

(8)

$\displaystyle p(y\vert x,y^d) = \sum_u \pi^A(x,y^d,u) P(x, u, y).$

(9)

$\displaystyle E_{\pi^E,\pi^A}(x,y^d) = \sum_{y} p(y\vert x,y^d) \left(R(x,y) + \gamma \sum_{z^d} \pi^E(y,z^d) E_{\pi^E,\pi^A}(y,z^d)\right).$

Definition 4.1 For a fixed controller policy $\pi^A$ , an event-value function $E^*_{\pi^A}$ is optimal if it satisfies

$\displaystyle E^*_{\pi^A}(x,y^d) \ge E_{\pi^E,\pi^A}(x,y^d)$

for all

and $\pi^E$ .

It can be shown that $E^*_{\pi^A}$ satisfies the following Bellman-type equations:

$\displaystyle E^*_{\pi^A}(x,y^d)$	$\displaystyle =$	$\displaystyle \sum_u \pi^A(x,y^d,u) \sum_y P(x,u,y) \Big( R(x,y) + \gamma V^*_{\pi^A}(y) \Big),$ where
$\displaystyle V^*_{\pi_A}(x)$	$\displaystyle =$	$\displaystyle \max_{z^d} E^*_{\pi^A}(y,z^d)$

Remark 4.2 It is easy to see that $\max_{\pi^A} V^*_{\pi^A}(x) = V^*(x)$ . A controller policy $\pi^A_*$ is optimal, if it maximizes the l.h.s. of the expression.

An optimal event-value function with respect to an optimal controller policy will be denoted by