Probabilistic Inference With Real Data

where is the number of variables in . This score can be thought of as the average number of bits needed to encode the observation of a variable in the test set. Note that we measure how well a dependency network predicts an entire case. We could look at predictions of particular conditional probabilities, but because Algorithm 1 uses products of queries to produce a prediction on a full case, we expect comparisons on individual queries to be similar. In our experiments, we determine the probabilities in Equation 3 from a learned dependency network using Algorithm 1. For each pseudo-Gibbs sampler invoked in Algorithm 1, we average 5000 iterations after a 10-iteration burn-in. For each data set, these Gibbs-sampling parameters yield scores with a range of variation of less than 0.1% across 10 runs starting with different (random) initial states. For comparison, we measure the accuracy of two additional model classes: (1) a Bayesian network, and (2) a Bayesian network with no arcs--a

Dataset |
|||||

Sewall/Shah | WAM | Digit2 | Digit6 | Nielsen | |

Number of variables | 5 | 6 | 64 | 64 | 203 |

Training cases | 9286 | 790 | 700 | 700 | 1637 |

Test cases | 1032 | 400 | 399 | 400 | 1637 |

Score(BN) | 1.274 | 0.907 | 0.542 | 0.422 | 0.188 |

Score(DN) | 1.277 | 0.911 | 0.584 | 0.454 | 0.189 |

Score(BL) | 1.382 | 0.930 | 0.823 | 0.752 | 0.231 |

Score(DN)-Score(BN) | 0.002 | 0.004 | 0.042 | 0.033 | 0.001 |

Score(BL)-Score(BN) | 0.107 | 0.022 | 0.281 | 0.330 | 0.044 |

1.00 | 1.00 | 1.03 | 1.02 | 1.00 | |

0.02 | 0.16 | 0.15 | 0.10 | 0.02 |

That dependency networks are (slightly) less accurate than Bayesian networks is not surprising. In each domain, the number of parameters in the Bayesian network are fewer than the number of parameters in the corresponding dependency network. Consequently, the dependency-network estimates should have higher variance. This explanation is consistent with the observation that, roughly, differences in accuracy are larger for the data sets with smaller sample size. Because dependency networks produce joint probabilities via pseudo-Gibbs sampling whereas Bayesian networks produce joint probabilities via multiplication, non-convergence of sampling is another possible explanation for the greater accuracy of Bayesian networks. The small variances of the pseudo-Gibbs-sampler estimates across multiple runs, however, makes this explanation unlikely. Overall, our experiments suggest that Algorithm 1 applied to inconsistent dependency networks learned from data yields accurate joint probabilities. Finally, let us consider issues of computation. Joint estimates produced by dependency networks require far more computation than do those produced by Bayesian networks. For example, on a 600 MHz Pentium III with 128 MB of memory running the Windows 2000 operating system, the determination of for a case in the Nielsen dataset takes on average 2.0 seconds for the dependency network and 0.0006 seconds for the Bayesian network. Consequently, one should use a Bayesian network in those situations where it is known in advance that joint probabilities are needed. Nonetheless, in situations where general probabilistic inference is needed, exact inference in a Bayesian network is often intractable; and practitioners often turn to Gibbs sampling for inference. In such situations, Bayesian networks afford little computational advantage.