next up previous
Next: General Dependency Networks Up: Consistent Dependency Networks Previous: Definition and Basic Properties

Probabilistic Inference

Given a graphical model for ${\bf X}$, we often wish to answer probabilistic queries of the form $p({\bf y}\vert{\bf z})$, where ${\bf Y}$ (the ``target'' variables) and ${\bf Z}$ (the ``input'' variables) are disjoint subsets of ${\bf X}$. This task is known in the graphical modeling community as probabilistic inference. An important special case of probabilistic inference is the determination of $p({\bf x})$ for given instances ${\bf x}$ of ${\bf X}$--a key component of density estimation, which has numerous applications. Given a consistent dependency network for ${\bf X}$, we can perform probabilistic inference by converting it to a Markov network, triangulating that network (forming a decomposable graphical model), and then applying one of the standard algorithms for probabilistic inference in the latter representation--for example, the junction tree algorithm of Jensen, Lauritzen, and Olesen (1990). Alternatively, we can use Gibbs sampling (e.g., Geman and Geman, 1984; Neal, 1993; Besag, Green, Higdon, and Mengersen, 1995; Gilks, Richardson, and Spiegelhalter, 1996), which we examine in some detail. First, let us consider the use of Gibbs sampling for recovering the joint distribution $p({\bf x})$ of a consistent dependency network for ${\bf X}$. In one simple version of Gibbs sampling, we initialize each variable to some arbitrary value. We then repeatedly cycle through each variable $X_1, \ldots, X_n$, in this order, and resample each $X_i$ according to $p(x_i\vert{\bf x}\setminus x_i)=p(x_i\vert{\bf pa}_i)$. We call this procedure an ordered Gibbs sampler. As described by the following theorem, this ordered Gibbs sampler recovers the joint distribution for ${\bf X}$.

Theorem 2: An ordered Gibbs sampler applied to a consistent dependency network for ${\bf X}$, where each $X_i$ is finite (and hence discrete) and each local distribution $p(x_i\vert{\bf pa}_i)$ is positive, defines a Markov chain with a unique stationary joint distribution for ${\bf X}$ equal to $p({\bf X})$ that can be reached from any initial state of the chain.

Proof: Let ${\bf x}^t$ be the sample of ${\bf x}$ after the $t^{th}$ iteration of the ordered Gibbs sampler. The sequence ${\bf x}^1,{\bf x}^2,\ldots$ can be viewed as samples drawn from a homogenous Markov chain with transition matrix ${\bf P}$ having elements ${\bf P}_{ij}=
p({\bf x}^{t+1}=j\vert{\bf x}^{t}=i)$. The matrix ${\bf P}$ is the product ${\bf P}^1
\cdot {\bf P}^2 \cdot \ldots \cdot {\bf P}^n$, where ${\bf P}^k$ is the ``local'' transition matrix describing the resampling of $X_k$ according to the local distribution $p(x_k\vert{\bf pa}_k)$. The positivity of local distributions guarantees the positivity of ${\bf P}$, which in turn guarantees the irreducibility of the Markov chain. Consequently, there exists a unique joint distribution that is stationary with respect to ${\bf P}$. The positivity of ${\bf P}$ also guarantees that this stationary distribution can be reached from any starting point. That this stationary distribution is equal to $p({\bf x})$ is proved in the Appendix. $\Box$

After a sufficient number of iterations, the samples in the chain will be drawn from the stationary distribution for ${\bf X}$. We use these samples to estimate $p({\bf X})$. There is much written about how long to run the chain before keeping samples (the ``burn in'') and how many samples to use to obtain a good estimate (e.g., Gilks, Richardson, and Spiegelhalter, 1996). We do not discuss the details here. Next, let us consider the use of Gibbs sampling to compute $p({\bf y}\vert{\bf z})$ for particular instances ${\bf y}$ and ${\bf z}$ where ${\bf Y}$ and ${\bf Z}$ are arbitrary disjoint subsets of ${\bf X}$. A naive approach uses an ordered Gibbs sampler directly. During the iterations, only samples of ${\bf X}$ where ${\bf Z}={\bf z}$ are used to compute the conditional probability. An important difficulty with this approach is that if either $p({\bf y}\vert{\bf z})$ or $p({\bf z})$ is small (a common occurrence when ${\bf Y}$ or ${\bf Z}$ contain many variables), many iterations are required for an accurate probability estimate. A well-known approach for estimating $p({\bf y}\vert{\bf z})$ when $p({\bf z})$ is small is to fix ${\bf Z}={\bf z}$ during ordered Gibbs sampling. It is not difficult to generalize the proof of Theorem 2 to show that this modified ordered Gibbs sampler has a stationary distribution and yields the correct conditional probability given a consistent dependency network. When ${\bf y}$ is rare because ${\bf Y}$ contains many variables, we can use the independencies encoded in a dependency network along with the law of total probability to decompose the inference task into a set of inference tasks on single variables. For example, consider the dependency network $\left[ X_1 \ \ X_2 \leftrightarrow
X_3 \right]$. Given the independencies specified by this network, we have

p(x_1,x_2,x_3) = p(x_1) \ p(x_2) \ p(x_3\vert x_2),

and can obtain $p(x_1,x_2,x_3)$ by computing each term separately. The determination of the first term requires no Gibbs sampling--the distribution can be read directly from the local distribution for $X_1$ in the dependency network. The second two terms can be determined using modified ordered Gibbs samplers each with a singleton target ($x_2$ and $x_3$, respectively). Note that this approach has the additional advantage that some terms may be obtained by direct lookup, thereby avoiding some Gibbs sampling. In general, given a consistent dependency network for ${\bf X}$ and disjoint sets of variables ${\bf Y}\subseteq {\bf X}$ and ${\bf Z}\subseteq
{\bf X}$, we can obtain $p({\bf y}\vert{\bf z})$ for a particular instance of ${\bf y}$ and ${\bf z}$ as follows:

Algorithm 1: 

1 ${\bf U}:= {\bf Y}$ (* the unprocessed variables *)
2 ${\bf P}:= {\bf Z}$ (* the processed and conditioning variables *)
3 ${\bf p}:= {\bf z}$ (* the values for ${\bf P}$ *)
4 While ${\bf U}\neq \emptyset$
5 Choose $X_i \in {\bf U}$ such that $X_i$ has no more parents in ${\bf U}$ than any variable in ${\bf U}$
6 If all the parents of $X$ are in ${\bf P}$
7 $p(x_i\vert{\bf p}) := p(x_i\vert{\bf pa}_i)$
8 Else
9 Use a modified ordered Gibbs sampler to determine $p(x_i\vert{\bf p})$
10 ${\bf U}:= {\bf U}- X_i$
11 ${\bf P}:= {\bf P}+ X_i$
12 ${\bf p}:= {\bf p}+ x_i$
13 Return the product of the conditionals $p(x_i\vert{\bf p})$

The key step in this algorithm is step 7, which bypasses Gibbs sampling when it is not needed. This step is justified by Equation 1.

next up previous
Next: General Dependency Networks Up: Consistent Dependency Networks Previous: Definition and Basic Properties
Journal of Machine Learning Research, 2000-10-19