next up previous
Next: Marginalization, inference and sampling Up: Learning with Mixtures of Previous: Related work

Tree distributions

In this section we introduce the tree model and the notation that will be used throughout the paper. Let $V$ denote a set of $n$ discrete random variables of interest. For each random variable $v\in V$ let $\Omega(v)$ represent its range, $x_v \in\Omega(v)$ a particular value, and $r_v$ the (finite) cardinality of $\Omega(v)$. For each subset $A$ of $V$, let $\Omega(A) = \bigotimes_{v\in A} \Omega(v)$ and let $x_A$ denote an assignment to the variables in $A$. To simplify notation $x_V$ will be denoted by $x$ and $\Omega(V)$ will be denoted simply by $\Omega$. Sometimes we need to refer to the maximum of $r_v$ over $V$; we denote this value by $r_{\it max}$. We begin with undirected (Markov random field) representations of tree distributions. Identifying the vertex set of a graph with the set of random variables $V$, consider a graph $G = (V,E)$, where $E$ is a set of undirected edges. We allow a tree to have multiple connected components (thus our ``trees'' are generally called forests). Given this definition, the number of edges $\vert E\vert$ and the number of connected components $p$ are related as follows:

\vert E\vert+p \; =\; \vert V\vert,

implying that adding an edge to a tree reduces the number of connected components by 1. Thus, a tree can have at most $\vert V\vert-1=n-1$ edges. In this latter case we refer to the tree as a spanning tree. We parameterize a tree in the following way. For $u, v \in V$ and $(u,v)\in E$, let $T_{uv}$ denote a joint probability distribution on $u$ and $v$. We require these distributions to be consistent with respect to marginalization, denoting by $T_u(x_u)$ the marginal of $T_{uv}(x_u,x_v)$ or $T_{vu}(x_v,x_u)$ with respect to $x_v$ for any $v \neq u$. We now assign a distribution $T$ to the graph $(V,E)$ as follows:
T(x) \;=\; \frac{\prod_{(u,v)\in E}T_{uv}(x_u,x_v)}{\prod_{v\in
V}T_{v}(x_v)^{{\rm deg} v-1}},
\end{displaymath} (1)

where deg$  v$ is the degree of vertex $v$; i.e., the number of edges incident to $v\in V$. It can be verified that $T$ is in fact a probability distribution; moreover, the pairwise probabilities $T_{uv}$ are the marginals of $T$. A tree distribution $T$ is defined to be any distribution that admits a factorization of the form (1). Tree distributions can also be represented using directed (Bayesian network) graphical models. Let $G^D = (V,E^D)$ denote a directed tree (possibly a forest), where $E^D$ is a set of directed edges and where each node $v$ has (at most) one parent, denoted pa$(v)$. We parameterize this graph as follows:
T(x) \;=\;\prod_{v \in V} T_{v\vert{\rm pa(}v)}(x_v\vert x_{{\rm pa(}v)})
\end{displaymath} (2)

where $T_{v\vert{\rm pa(}v)}(x_v\vert x_{{\rm pa(}v)})$ is an arbitrary conditional distribution. It can be verified that $T$ indeed defines a probability distribution; moreover, the marginal conditionals of $T$ are given by the conditionals $T_{v\vert{\rm pa(}v)}$. We shall call the representations (1) and (2) the undirected and directed tree representations of the distribution $T$ respectively. We can readily convert between these representations; for example, to convert (1) to a directed representation we choose an arbitrary root in each connected component and direct each edge away from the root. For $(u,v)\in E$ with $u$ closer to the root than $v$, let pa$(v) = u$. Now compute the conditional probabilities corresponding to each directed edge by recursively substituting ${T_{v{\rm pa}(v)}}/{T_{{\rm pa}(v)}}$ by $T_{v\vert{\rm pa}(v)}$ starting from the root. Figure 2 illustrates this process on a tree with 5 vertices.

Figure 2: A tree in its undirected (a) and directed (b) representations.
...{2\vert 3}( x_{2}\vert x_{ 3 } ) $\\
(a) & (b)

The directed tree representation has the advantage of having independent parameters. The total number of free parameters in either representation is:
\sum_{(u,v) \in E}\!\!\!\! r_ur_v - \sum_{v \in V}({\rm ...
...m_{(u,v) \in E}\!\!\!(r_u-1)(r_v-1)+ \sum_{v \in V}r_v -   n
\end{displaymath} (3)

The right-hand side of (3) shows that each edge $(u,v)$ increases the number of parameters by $(r_u - 1)(r_v-1)$. The set of conditional independencies associated with a tree distribution are readily characterized [Lauritzen, Dawid, Larsen, Leimer 1990]. In particular, two subsets $A,B \subset V$ are independent given $C\subset V$ if $C$ intersects every path (ignoring the direction of edges in the directed case) between $u$ and $v$ for all $u\in A$ and $v\in B$.

next up previous
Next: Marginalization, inference and sampling Up: Learning with Mixtures of Previous: Related work
Journal of Machine Learning Research 2000-10-19