Similarly to most other RL algorithms, the E-learning algorithm
[Lorincz et al.(2002)] also uses a value function, the
*event-value function*
. Pairs of
states and are called *events* and
*desired events*, respectively. For a given initial state
, let us denote the desired next state by . The
state sequence is the desired event, or
*subtask*, is the value of trying to get from
actual state to next state . State reached by the
subtask could be different from the desired state . One of
the advantages of this formulation is that one may--but does not
have to--specify the transition time: Realizing the subtask may
take more than one step for the controller, which is working in
the background.

This value may be different from the expected discounted total reward of eventually
getting from to . We use the former definition, since we want to use the
event-value function for finding an optimal successor state. To this end, the
event-selection policy
is introduced.
gives the probability of selecting desired state in state . However, the system
usually cannot be controlled by ``wishes" (desired new states);
decisions have to be
expressed in actions. This is done by the action-selection policy (or controller policy)
, where
gives the probability
that the agent selects action to realize the transition .^{4}

An important property of event learning is the following: only the event-selection policy is learned (through the event-value function) and the learning problem of the controller's policy is separated from E-learning. From the viewpoint of E-learning, the controller policy is part of the environment, just like the transition probabilities.

The event-value function corresponding to a given action selection policy can be expressed by the state value function:

and conversely:

From the last two equations the recursive formula

can be derived. Denote by the probability that given the initial state and goal state , the controller and the environment drive the system to state in one step. Clearly,

Note that in an on-line process (where , and denote the actual state, the desired (planned) state and the immediate reward at time step ), the state is sampled from distribution . Using Equation 9, Equation 8 can be rewritten in the following form:

It can be shown that
satisfies the following
Bellman-type equations:

where | |||

An optimal event-value function with respect to an optimal controller policy will be denoted by .