Decomposable priors for tree parameters

Next: Experiments Up: Decomposable priors and MAP Previous: Decomposable priors for tree

Decomposable priors for tree parameters

The decomposable prior for parameters that we introduce is a Dirichlet prior [Heckerman, Geiger, Chickering 1995]. The Dirichlet distribution is defined over the domain of $\theta_{1,\ldots, r}>0,\; \sum_j \theta_j = 1$ and has the form

$\begin{displaymath} D(\theta_{1,\ldots, r};N'_{1,\ldots, r})\;\propto\; \prod_{j=1}^r \theta_j^{N'_j-1}. \end{displaymath}$

The numbers $N'_{1,\ldots, r}>0$ that parametrize

can be interpreted as the sufficient statistics of a ``fictitious data set'' of size $N'=\sum_j N'_j$ . Therefore

are called fictitious counts.

represents the strength of the prior. To specify a prior for tree parameters, one must specify a Dirichlet distribution for each of the probability tables $T_{v\vert u=x_u}, \overline{uv}\in E^D$ , for each possible tree structure

. This is achieved by means of a set of parameters $N'_{vux_vx_u}$ satisfying

$\begin{displaymath} \sum_{x_u} N'_{vux_vx_u}\;=\;N'_{vx_v}, \;\;\;\;\;\; \sum_... ...} N'_{vx_v}\;=\;N'\;\;\;\;\;\;{\rm for}\;{\rm all}\;u,v\in V. \end{displaymath}$

With these settings, the prior for the parameters $T_{v\vert u}(x_v\vert x_u), x_v=1,\ldots, r_v$ in any tree that contains the directed edge $\overline{uv}$ is defined by $N'_{uvx_ux_v}, x_v=1,\ldots, r_v$ . This representation of the prior is not only compact (order $n^2r_{MAX}^2$ parameters) but it is also consistent: two different directed parametrizations of the same tree distribution receive the same prior. The assumptions allowing us to define this prior are explicated by MJa:uai00 and parallel the reasoning of heckerman:95 for general Bayes nets. Denote by

the empirical distribution obtained from a data set of size

and by $P'_{uv}( x_ux_v)=N'_{uvx_ux_v}/{N'}$ the distribution defined by the fictitious counts. Then, by a property of the Dirichlet distribution [Heckerman, Geiger, Chickering 1995] it follows that learning a MAP tree is equivalent to learning an ML tree for the weighted combination $\tilde{P}$ of the two ``datasets''

$\begin{displaymath} \tilde{P}\;=\;\frac{1}{N+N'}(N'P' +NP). \end{displaymath}$

(10)

Consequently, the parameters of the optimal tree will be $T_{uv}\;=\;\tilde{P}_{uv}$ . For a mixture of trees, maximizing the posterior translates into replacing

and

by $\Gamma_k$ in equation (10) above. This implies that the M step of the EM algorithm, as well as the E step, is exact and tractable in the case of MAP estimation with decomposable priors. Finally, note that the posteriors $Pr[ Q\vert{\cal D}]$ for models with different

are defined up to a constant that depends on

. Hence, one cannot compare posteriors of MTs with different numbers of mixture components

. In the experiments that we present, we chose

via other performance criteria: validation set likelihood in the density estimation experiments and validation set classification accuracy in the classification tasks.

Next: Experiments Up: Decomposable priors and MAP Previous: Decomposable priors for tree

Journal of Machine Learning Research 2000-10-19