next up previous
Next: Experiments Up: Decomposable priors and MAP Previous: Decomposable priors for tree

Decomposable priors for tree parameters

The decomposable prior for parameters that we introduce is a Dirichlet prior [Heckerman, Geiger, Chickering 1995]. The Dirichlet distribution is defined over the domain of $\theta_{1,\ldots,  r}>0,\; \sum_j \theta_j = 1$ and has the form

\begin{displaymath}
D(\theta_{1,\ldots,  r};N'_{1,\ldots,  r})\;\propto\; \prod_{j=1}^r \theta_j^{N'_j-1}.
\end{displaymath}

The numbers $N'_{1,\ldots,  r}>0$ that parametrize $D$ can be interpreted as the sufficient statistics of a ``fictitious data set'' of size $N'=\sum_j N'_j$. Therefore $N'_j$ are called fictitious counts. $N'$ represents the strength of the prior. To specify a prior for tree parameters, one must specify a Dirichlet distribution for each of the probability tables $T_{v\vert u=x_u}, \overline{uv}\in E^D$, for each possible tree structure $E^D$. This is achieved by means of a set of parameters $N'_{vux_vx_u}$ satisfying

\begin{displaymath}
\sum_{x_u} N'_{vux_vx_u}\;=\;N'_{vx_v}, \;\;\;\;\;\;
\sum_...
...} N'_{vx_v}\;=\;N'\;\;\;\;\;\;{\rm for}\;{\rm all}\;u,v\in V.
\end{displaymath}

With these settings, the prior for the parameters $T_{v\vert u}(x_v\vert x_u), x_v=1,\ldots, r_v$ in any tree that contains the directed edge $\overline{uv}$ is defined by $N'_{uvx_ux_v}, x_v=1,\ldots, r_v$. This representation of the prior is not only compact (order $n^2r_{MAX}^2$ parameters) but it is also consistent: two different directed parametrizations of the same tree distribution receive the same prior. The assumptions allowing us to define this prior are explicated by MJa:uai00 and parallel the reasoning of heckerman:95 for general Bayes nets. Denote by $P$ the empirical distribution obtained from a data set of size $N$ and by $P'_{uv}( x_ux_v)=N'_{uvx_ux_v}/{N'}$ the distribution defined by the fictitious counts. Then, by a property of the Dirichlet distribution [Heckerman, Geiger, Chickering 1995] it follows that learning a MAP tree is equivalent to learning an ML tree for the weighted combination $\tilde{P}$ of the two ``datasets''
\begin{displaymath}
\tilde{P}\;=\;\frac{1}{N+N'}(N'P' +NP).
\end{displaymath} (10)

Consequently, the parameters of the optimal tree will be $T_{uv}\;=\;\tilde{P}_{uv}$. For a mixture of trees, maximizing the posterior translates into replacing $P$ by $P^k$ and $N$ by $\Gamma_k$ in equation (10) above. This implies that the M step of the EM algorithm, as well as the E step, is exact and tractable in the case of MAP estimation with decomposable priors. Finally, note that the posteriors $Pr[ Q\vert{\cal D}]$ for models with different $m$ are defined up to a constant that depends on $m$. Hence, one cannot compare posteriors of MTs with different numbers of mixture components $m$. In the experiments that we present, we chose $m$ via other performance criteria: validation set likelihood in the density estimation experiments and validation set classification accuracy in the classification tasks.
next up previous
Next: Experiments Up: Decomposable priors and MAP Previous: Decomposable priors for tree
Journal of Machine Learning Research 2000-10-19