The ACCL algorithm

Next: Comparing mutual information between Up: The accelerated tree learning Previous: The accelerated tree learning

The ACCL algorithm

We first present the ACCL algorithm for the case of binary variables, presenting the extension to general discrete variables in Section 6.2. For binary variables we will say that a variable is ``on'' when it takes value 1; otherwise we say that it is ``off''. Without loss of generality we assume that a variable is off more times than it is on in the given dataset. The target distribution

is assumed to be derived from a set of observations of size

. Let us denote by

the number of times variable

is on in the dataset and by $N_{uv}$ the number of times variables

and

are simultaneously on. We call each of the latter events a co-occurrence of

and

. The marginal $P_{uv}$ of

and

is given by:

$\displaystyle N\cdot P_{uv}(1,1)$	$\textstyle =$	$\displaystyle N_{uv}$
$\displaystyle N\cdot P_{uv}(1,0)$	$\textstyle =$	$\displaystyle N_u - N_{uv}$
$\displaystyle N \cdot P_{uv}(0,1)$	$\textstyle =$	$\displaystyle N_v- N_{uv}$
$\displaystyle N \cdot P_{uv}(0,0)$	$\textstyle =$	$\displaystyle N-N_v-N_u+N_{uv}$

All the information about

that is necessary for fitting the tree is summarized in the counts

and $N_{uv}, \;u,v=1,\ldots,n$ , and from now on we will consider

to be represented by these counts. (It is an easy extension to handle non-integer data, such as when the data points are ``weighted'' by real numbers). We now define the notion of sparseness that motivates the ACCL algorithm. Let us denote by $\vert x\vert$ the number of variables that are on in observation

, where $0 \leq \vert x\vert \leq n$ . Define

, the data sparseness

$\begin{displaymath} s= \mbox{\raisebox{-1.7ex}{$\stackrel{\textstyle {\rm max}}{\scriptstyle i=1,N}$}} \vert x^i\vert. \end{displaymath}$

If, for example, the data are documents and the variables represent words from a vocabulary, then

represents the maximum number of distinct words in a document. The time and memory requirements of the algorithm that we describe depend on the sparseness

; the lower the sparseness, the more efficient the algorithm. Our algorithm will realize its largest performance gains when $s \ll n, N$ . Recall that the CHOWLIU algorithm greedily adds edges to a graph by choosing the edge that currently has the maximal value of mutual information. The algorithm that we describe involves an efficient way to rank order mutual information. There are two key aspects to the algorithm: (1) how to compare mutual information between non-co-occurring variables, and (2) how to compute co-occurrences in a list representation.

Subsections

Next: Comparing mutual information between Up: The accelerated tree learning Previous: The accelerated tree learning

Journal of Machine Learning Research 2000-10-19