next up previous
Next: Conclusions Up: The accelerated tree learning Previous: Presorting mutual information values


Experiments

In this section we report on the results of experiments that compare the speed of the ACCL algorithm with the standard CHOWLIU method on artificial data. In our experiments the binary domain dimension $n$ varies from $50$ to $1000$. Each data point has a fixed number $s$ of variables taking value 1. The sparseness $s$ takes the values 5, 10, 15 and 100. The data were generated from an artificially constructed non-uniform, non-factored distribution. For each pair $(n,s)$ a set of 10,000 points was created. For each data set both the CHOWLIU algorithm and the ACCL algorithm were used to fit a single tree distribution. The running times are plotted in Figure 23. The improvements of ACCL over the standard version are spectacular: learning a tree on 1000 variables from 10,000 data points takes 4 hours with the standard algorithm and only 4 seconds with the accelerated version when the data are sparse ($s\leq 15$). For the less sparse regime of $s=100$ the ACCL algorithm takes 2 minutes to complete, improving on the traditional algorithm by a factor of 120. Note also that the running time of the accelerated algorithm seems to be nearly independent of the dimension of the domain. Recall on the other hand that the number of steps $n_K$ (Figure 24) grows with $n$. This implies that the bulk of the computation lies in the steps preceding the Kruskal algorithm proper. Namely, it is in the computing of co-occurrences and organizing the data that most of the time is spent. Figure 23 also confirms that the running time of the traditional CHOWLIU algorithm grows quadratically with $n$ and is independent of $s$. This concludes the presentation of the ACCL algorithm. The method achieves its performance gains by exploiting characteristics of the data (sparseness) and of the problem (the weights represent mutual information) that are external to the maximum weight spanning tree algorithm proper. The algorithm obtained is worst case $\tilde{{\cal O}}(sn+s^2N+n^2)$ but typically $\tilde{{\cal O}}(sn+s^2N)$, which represents a significant asymptotic improvement over the ${\cal O}(n^2N)$ of the traditional Chow and Liu algorithm. Moreover, if $s$ is large, then the ACCL algorithm (gracefully) degrades to the standard CHOWLIU algorithm. The algorithm extends to non-integer counts, hence being directly applicable to mixtures of trees. As we have seen empirically, a very significant part of the running time is spent in computing co-occurrences. This prompts future work on learning statistical models over large domains that focus on the efficient computation and usage of the relevant sufficient statistics. Work in this direction includes the structural EM algorithm [Friedman 1998,Friedman, Getoor 1999] as well as A-D trees [Moore, Lee 1998]. The latter are closely related to our representation of the pairwise marginals $P_{uv}$ by counts. In fact, our representation can be viewed as a ``reduced'' A-D tree that stores only pairwise statistics. Consequently, when an A-D tree representation is already computed, it can be exploited in steps 1 and 2 of the ACCL algorithm. Other versions of the ACCL algorithm are discussed by THESIS.
next up previous
Next: Conclusions Up: The accelerated tree learning Previous: Presorting mutual information values
Journal of Machine Learning Research 2000-10-19