Near Consistency: Theoretical Considerations

Next: Related Work Up: General Dependency Networks Previous: Probabilistic Inference With Real

Near Consistency: Theoretical Considerations

Our concerns of the previous section were motivated by the possibility that small deviations in the local distributions of a dependency network could be amplified by the process of pseudo-Gibbs sampling. In this section, we examine this concern more directly and from a theoretical perspective. Consider two dependency networks: one learned from data and another that encodes the true distribution from which the data was sampled. Note that it is always possible to find a dependency network that encodes the true distribution--for example, a fully-connected dependency network can encode any joint distribution. Let $\tilde{\bf P}$ and ${\bf P}$ denote the transition matrices of the Markov chains defined by the learned and truth-encoding dependency networks, respectively. In addition, let $\tilde{\bf\pi}=(\tilde{\pi}_1,\ldots,\tilde{\pi}_k)$ and ${\bf\pi}=(\pi_1,\ldots,\pi_k)$ denote the stationary distributions corresponding to $\tilde{\bf P}$ and ${\bf P}$ , respectively. Our concern about sensitivity to deviations in local distributions can now be phrased as follows: If $\tilde{\bf P}$ is close to ${\bf P}$ , will $\tilde{\bf\pi}$ be close ${\bf\pi}$ ? There is a large literature in an area generally known as perturbation theory of stochastic matrices that provides answers to this question for various notions of ``close''. Answers usually are of the form

$\displaystyle \Vert \tilde{\bf\pi}- {\bf\pi}\Vert \leq \kappa \ \Vert {\bf E}\Vert, \ \ {\rm or}$			(4)
$\displaystyle \vert \tilde{\pi}_j - \pi_j \vert \leq \kappa_j \ \Vert {\bf E}\Vert, \ \ {\rm or}$			(5)
$\displaystyle \left\vert \frac{\tilde{\pi}_j - \pi_j}{\pi_j} \right\vert \leq \kappa_j \ \Vert {\bf E}\Vert,$			(6)

where ${\bf E}= \tilde{\bf P}- {\bf P}$ , $\Vert\cdot\Vert$ is some norm, and $\kappa$ or $\kappa_j$ are measures of sensitivity called condition numbers. When working with probabilities, where relative as opposed to absolute errors are often important, bounds of the form shown in Equation 6 are particularly useful. Here, we cite a potentially useful bound of this form given by Cho and Meyer (1999). These authors provide references to many other bounds as well.

Theorem (Cho and Meyer, 1999): Let ${\bf P}$ and $\tilde{\bf P}={\bf P}+{\bf E}$ be transition matrices for two homogenous, irreducible -state Markov chains with respective stationary distributions ${\bf\pi}$ and $\tilde{\bf\pi}$ . Let $\Vert E \Vert _{\infty}$ denote the infinity-norm of E, the maximum over $j=1,\ldots,k$ of the column sums $\sum_{i=1}^k \vert{\bf E}_{ij}\vert$ . Let $M_{ij}$ denote the mean first passage time from the $i^{th}$ state to the $j^{th}$ state in the chain corresponding to ${\bf P}$ --that is, the expected number of transitions to move from state to state . Then, the relative change in the $j^{th}$ stationary probability $\pi_j$ is given by

$\begin{displaymath} \frac{\vert\tilde{\pi}_j - \pi_j\vert}{\pi_j} \leq \frac{1}{2} \ \Vert E \Vert _{\infty} \ \max_{i \neq j} M_{ij}. \end{displaymath}$

(7)

Their bound is tight in the sense that, for any ${\bf P}$ satisfying the conditions of the theorem, there exists a perturbation ${\bf E}\neq {\bf0}$ such that the inequality in Equation 7 can be replaced with an equality. Nonetheless, when applied to dependency networks, the bound is typically not tight. To illustrate this point, we computed each term in Equation 7 for a transition matrix ${\bf P}$ corresponding to the Bayesian network learned from 790 cases from the WAM data set, and a $\tilde{\bf P}$ corresponding to a dependency network learned from a random sample of 790 cases generated from that Bayesian network. For , the state corresponding to $X_1=\cdots=X_6=0$ in the WAM domain, we obtain the inequality

$\begin{displaymath} \begin{tabular}{ccccccccc} $\frac{\vert\pi_j - \tilde{\pi}... ...& $\times$\ & 0.84 & $\times$\ & 40 & = & 16.8 \end{tabular} \end{displaymath}$

which is far from tight. Nonetheless, the appearance of mixing time in Equation 7 is interesting. It suggests that that chains with good convergence properties will be insensitive to perturbations in the transition matrix. In particular, it suggests that the joint distribution defined by a dependency network is more likely to be insensitive to errors in the local distributions of the network precisely when Gibbs sampling is effective. Hopefully, additional research will produce tighter bounds that better characterize those situations in which dependency networks can be used safely.

Next: Related Work Up: General Dependency Networks Previous: Probabilistic Inference With Real

Journal of Machine Learning Research, 2000-10-19