next up previous
Next: Comparing mutual information between Up: The accelerated tree learning Previous: The accelerated tree learning


The ACCL algorithm

We first present the ACCL algorithm for the case of binary variables, presenting the extension to general discrete variables in Section 6.2. For binary variables we will say that a variable is ``on'' when it takes value 1; otherwise we say that it is ``off''. Without loss of generality we assume that a variable is off more times than it is on in the given dataset. The target distribution $P$ is assumed to be derived from a set of observations of size $N$. Let us denote by $N_v$ the number of times variable $v$ is on in the dataset and by $N_{uv}$ the number of times variables $u$ and $v$ are simultaneously on. We call each of the latter events a co-occurrence of $u$ and $v$. The marginal $P_{uv}$ of $u$ and $v$ is given by:
$\displaystyle N\cdot P_{uv}(1,1)$ $\textstyle =$ $\displaystyle N_{uv}$  
$\displaystyle N\cdot P_{uv}(1,0)$ $\textstyle =$ $\displaystyle N_u - N_{uv}$  
$\displaystyle N \cdot P_{uv}(0,1)$ $\textstyle =$ $\displaystyle N_v- N_{uv}$  
$\displaystyle N \cdot P_{uv}(0,0)$ $\textstyle =$ $\displaystyle N-N_v-N_u+N_{uv}$  

All the information about $P$ that is necessary for fitting the tree is summarized in the counts $N$, $N_v$ and $N_{uv}, \;u,v=1,\ldots,n$, and from now on we will consider $P$ to be represented by these counts. (It is an easy extension to handle non-integer data, such as when the data points are ``weighted'' by real numbers). We now define the notion of sparseness that motivates the ACCL algorithm. Let us denote by $\vert x\vert$ the number of variables that are on in observation $x$, where $0 \leq \vert x\vert \leq n$. Define $s$, the data sparseness

\begin{displaymath}
s= \mbox{\raisebox{-1.7ex}{$\stackrel{\textstyle
{\rm max}}{\scriptstyle i=1,N}$}} \vert x^i\vert.
\end{displaymath}

If, for example, the data are documents and the variables represent words from a vocabulary, then $s$ represents the maximum number of distinct words in a document. The time and memory requirements of the algorithm that we describe depend on the sparseness $s$; the lower the sparseness, the more efficient the algorithm. Our algorithm will realize its largest performance gains when $s \ll n, N$. Recall that the CHOWLIU algorithm greedily adds edges to a graph by choosing the edge that currently has the maximal value of mutual information. The algorithm that we describe involves an efficient way to rank order mutual information. There are two key aspects to the algorithm: (1) how to compare mutual information between non-co-occurring variables, and (2) how to compute co-occurrences in a list representation.

Subsections
next up previous
Next: Comparing mutual information between Up: The accelerated tree learning Previous: The accelerated tree learning
Journal of Machine Learning Research 2000-10-19