next up previous
Next: Mixtures of trees Up: Tree distributions Previous: Representational capabilities


Learning of tree distributions

The learning problem is formulated as follows: we are given a set of observations ${\cal D}=\{x^1,  x^2,   \ldots,  x^N\}$ and we are required to find the tree $T^*$ that maximizes the log likelihood of the data:

\begin{displaymath}
T^*\;=\;\mbox{\raisebox{-1.7ex}{$\stackrel{\textstyle
{\rm argmax}}{\scriptstyle T}$}}   \sum_{i=1}^N \log T(x^i),
\end{displaymath}

where $x^i$ is an assignment of values to all variables. Note that the maximum is taken both with respect to the tree structure (the choice of which edges to include) and with respect to the numerical values of the parameters. Here and in the rest of the paper we will assume for simplicity that there are no missing values for the variables in $V$, or, in other words, that the observations are complete. Letting $P(x)$ denote the proportion of observations $x_i$ in the training set ${\cal D}$ that are equal to $x$, we can alternatively express the maximum likelihood problem by summing over configurations $x$:

\begin{displaymath}
T^*\;=\;\mbox{\raisebox{-1.7ex}{$\stackrel{\textstyle
{\r...
...}}{\scriptstyle T}$}}   \sum_{x \in \Omega} P(x) \log T(x).
\end{displaymath}

In this form we see that the log likelihood criterion function is a (negative) cross-entropy. We will in fact solve the problem in general, letting $P(x)$ be an arbitrary probability distribution. This generality will prove useful in the following section where we consider mixtures of trees. The solution to the learning problem is an algorithm, due to[Chow, Liu 1968] , that has quadratic complexity in $n$ (see Figure 3). There are three steps to the algorithm. First, we compute the pairwise marginals $P_{uv}(x_u,x_v)=\sum_{x_{V-\{u,v\}}}P(x_u,x_v,x_{V-\{u,v\}})$. If $P$ is an empirical distribution, as in the present case, computing these marginals requires ${\cal O}(n^2N)$ operations. Second, from these marginals we compute the mutual information between each pair of variables in $V$ under the distribution $P$:

\begin{displaymath}
I_{uv} \;=\; \sum_{x_ux_v} P_{uv}(x_u,x_v)
\log\frac{P_{uv...
...)} {P_u(x_u)P_v(x_v)},\;\;\;\;u,v\in V,
u=\!\!\!\!\!\!/ v,
\end{displaymath}

an operation that requires ${\cal O}(n^2 r^2_{MAX})$ operations. Third, we run a maximum-weight spanning tree (MWST) algorithm [Cormen, Leiserson, Rivest 1990], using $I_{uv}$ as the weight for edge $(u,v), for all u, v \in V$. Such algorithms, which run in time ${\cal O}(n^2)$, return a spanning tree that maximizes the total mutual information for edges included in the tree.

Figure 3: The Chow and Liu algorithm for maximum likelihood estimation of tree structure and parameters.
\begin{figure}
\begin{tabular}{l}
\hline
\\
Algorithm {\sc ChowLiu}$(P)$ \...
..._T$}\\
\\
{\bf Output:} $T$ \\
\\
\hline
\end{tabular}
\end{figure}

Chow and Liu showed that the maximum-weight spanning tree also maximizes the likelihood over tree distributions $T$, and moreover the optimizing parameters $T_{uv}^k$ (or $T^k_{u\vert v}$), for $(u,v)\in E_{T^k}$, are equal to the corresponding marginals $P_{uv}$ of the distribution $P$:

\begin{displaymath}
T_{uv}\;\equiv\;P_{uv}.
\end{displaymath}

The algorithm thus attains a global optimum over both structure and parameters.
next up previous
Next: Mixtures of trees Up: Tree distributions Previous: Representational capabilities
Journal of Machine Learning Research 2000-10-19