Learning of tree distributions

Next: Mixtures of trees Up: Tree distributions Previous: Representational capabilities

Learning of tree distributions

The learning problem is formulated as follows: we are given a set of observations ${\cal D}=\{x^1, x^2, \ldots, x^N\}$ and we are required to find the tree

that maximizes the log likelihood of the data:

$\begin{displaymath} T^*\;=\;\mbox{\raisebox{-1.7ex}{$\stackrel{\textstyle {\rm argmax}}{\scriptstyle T}$}} \sum_{i=1}^N \log T(x^i), \end{displaymath}$

where

is an assignment of values to all variables. Note that the maximum is taken both with respect to the tree structure (the choice of which edges to include) and with respect to the numerical values of the parameters. Here and in the rest of the paper we will assume for simplicity that there are no missing values for the variables in

, or, in other words, that the observations are complete. Letting

denote the proportion of observations

in the training set ${\cal D}$ that are equal to

, we can alternatively express the maximum likelihood problem by summing over configurations

$\begin{displaymath} T^*\;=\;\mbox{\raisebox{-1.7ex}{$\stackrel{\textstyle {\r... ...}}{\scriptstyle T}$}} \sum_{x \in \Omega} P(x) \log T(x). \end{displaymath}$

In this form we see that the log likelihood criterion function is a (negative) cross-entropy. We will in fact solve the problem in general, letting

be an arbitrary probability distribution. This generality will prove useful in the following section where we consider mixtures of trees. The solution to the learning problem is an algorithm, due to[Chow, Liu 1968] , that has quadratic complexity in

(see Figure 3). There are three steps to the algorithm. First, we compute the pairwise marginals $P_{uv}(x_u,x_v)=\sum_{x_{V-\{u,v\}}}P(x_u,x_v,x_{V-\{u,v\}})$ . If

is an empirical distribution, as in the present case, computing these marginals requires ${\cal O}(n^2N)$ operations. Second, from these marginals we compute the mutual information between each pair of variables in

under the distribution

$\begin{displaymath} I_{uv} \;=\; \sum_{x_ux_v} P_{uv}(x_u,x_v) \log\frac{P_{uv... ...)} {P_u(x_u)P_v(x_v)},\;\;\;\;u,v\in V, u=\!\!\!\!\!\!/ v, \end{displaymath}$

an operation that requires ${\cal O}(n^2 r^2_{MAX})$ operations. Third, we run a maximum-weight spanning tree (MWST) algorithm [Cormen, Leiserson, Rivest 1990], using $I_{uv}$ as the weight for edge $(u,v), for all u, v \in V$ . Such algorithms, which run in time ${\cal O}(n^2)$ , return a spanning tree that maximizes the total mutual information for edges included in the tree.

**Figure 3:** The Chow and Liu algorithm for maximum likelihood estimation of tree structure and parameters.
$\begin{figure} \begin{tabular}{l} \hline \\ Algorithm {\sc ChowLiu}$(P)$ \... ..._T$}\\ \\ {\bf Output:} $T$ \\ \\ \hline \end{tabular} \end{figure}$

Chow and Liu showed that the maximum-weight spanning tree also maximizes the likelihood over tree distributions

, and moreover the optimizing parameters $T_{uv}^k$ (or $T^k_{u\vert v}$ ), for $(u,v)\in E_{T^k}$ , are equal to the corresponding marginals $P_{uv}$ of the distribution

$\begin{displaymath} T_{uv}\;\equiv\;P_{uv}. \end{displaymath}$

The algorithm thus attains a global optimum over both structure and parameters.

Next: Mixtures of trees Up: Tree distributions Previous: Representational capabilities

Journal of Machine Learning Research 2000-10-19