next up previous
Next: Probabilistic Inference Up: Consistent Dependency Networks Previous: Consistent Dependency Networks

Definition and Basic Properties

We now describe dependency networks more formally. To do so, we begin with some notation. We denote a variable by a capitalized token (e.g., $X, X_i, \Theta$, Age), and the state or value of a corresponding variable by that same token in lower case (e.g., $x,
x_i, \theta$, age). We denote a set of variables by a bold-face capitalized token (e.g., ${\bf X}, {\bf X}_i, {\bf Pa}_i$). We use a corresponding bold-face lower-case token (e.g., ${\bf x}, {\bf x}_i, {\bf pa}_i$) to denote an assignment of state or value to each variable in a given set. We use $p(x\vert y)$ to denote the probability that $X=x$ given $Y=y$. We also use $p(x\vert y)$ to denote the probability distribution for $X$ given $Y$. Whether $p(x\vert y)$ refers to a probability or a probability distribution will be clear from context. In this paper, we shall limit our discussion to domains where all variables are discrete and finite valued and where the joint distribution is positive--that is, where all assignments of the domain variables have a non-zero probability. Although much of what we develop can be extended to more general circumstances, the extensions are tedious and we omit them. Given a domain of interest having a set of finite variables ${\bf X}=(X_1,\ldots,X_n)$ with a positive joint distribution $p({\bf x})$, a consistent dependency network for ${\bf X}$ is a pair $({\cal
G},{\cal P})$ where ${\cal G}$ is a (cyclic) directed graph and ${\cal P}$ is a set of conditional probability distributions. Each node in ${\cal G}$ corresponds to a variable in ${\bf X}$. We use $X_i$ to refer to both the variable and its corresponding node. The parents of node $X_i$, denoted ${\bf Pa}_i$, correspond to those variables ${\bf Pa}_i\subseteq (X_1,\ldots,X_{i-1},
X_{i+1},\ldots,X_n)$ that satisfy
p(x_i\vert{\bf pa}_i) = p(x_i\vert x_1,\ldots,x_{i-1},x_{i+1},\ldots,x_n) = p(x_i\vert{\bf x}
\setminus x_i).
\end{displaymath} (1)

The distributions in ${\cal P}$ are the local probability distributions $p(x_i\vert{\bf pa}_i), i=1,\ldots,n$. The dependency network is consistent in the sense that each local distribution can be obtained (via inference) from the joint distribution $p({\bf x})$. In the next section, we relax this condition. The independencies in a dependency network are precisely those of a Markov network with the same adjacencies. A Markov network for ${\bf X}$, also known as an undirected graphical model or Markov random field for ${\bf X}$, is a pair $({\cal U},\Phi)$ where ${\cal U}$ is an undirected graph and $\Phi=(\phi_1,\ldots,\phi_c)$ is a set of potential functions, one for each of the $c$ maximal cliques in ${\cal U}$, such that joint distribution has the form
p({\bf x}) = \prod_{i=1}^c \phi_i({\bf x}^i),
\end{displaymath} (2)

where ${\bf X}^i$ are the variables in clique $i$, $i=1,\ldots,c$ (e.g., see Lauritzen, 1996). The following theorem shows that consistent dependency networks and Markov networks have the same representational power.

Theorem 1: The set of positive distributions that can be encoded by a consistent dependency network with graph ${\cal G}$ is equal to the set of positive distributions that can be encoded by a Markov network whose structure has the same adjacencies as ${\cal G}$.

The two graphical representations are different in that Markov networks quantify dependencies with potential functions, whereas dependency networks use conditional probabilities. We have found the latter to be significantly easier to interpret. The proof of Theorem 1 appears in the Appendix, but it is essentially a restatement of the Hammersley-Clifford theorem (e.g., Besag, 1974). This correspondence is no coincidence. As is discussed in Besag (1974), several researchers who developed the Markov-network representation did so by initially investigating a graphical representation that fits our definition of consistent dependency network. In particular, several researchers including Lévy (1948), Bartlett (1955, Section 2.2), and Brook (1964) considered lattice systems where each variable $X_i$ depended only on its nearest neighbors ${\bf Pa}_i$, and quantified the dependencies within these systems using the conditional probability distributions $p(x_i\vert{\bf pa}_i)$. They then showed, to various levels of generality, that the only joint distributions mutually consistent with each of the conditional distributions also satisfy Equation 2. Hammersley and Clifford, in a never published manuscript, and Besag (1974) considered the more general case where each variable could have an arbitrary set of parents. They showed that, provided the joint distribution for ${\bf X}$ is positive, any graphical model specifying the independencies in Equation 1 must also satisfy Equation 2. One interesting point is that these researchers argued for the use of conditional distributions to quantify the dependencies. They considered the resulting potential form in Equation 2 to be a mathematical necessity rather than a natural expression of dependency. As we have just discussed, we share this view. The equivalence of consistent dependency networks and Markov networks suggests a straightforward approach for learning a consistent dependency network from exchangeable (i.i.d.) data. Namely, one learns the structure and potentials of a Markov network (e.g., Whittaker, 1990), and then computes (via probabilistic inference) the conditional distributions required by the dependency network. Alternatively, one can learn a related model such as a Bayesian network, decomposable model, or hierarchical log-linear model (see, e.g., Lauritzen, 1996) and convert it to a consistent dependency network. Unfortunately, the conversion process can be computationally expensive in many situations. In the next section, we extend the definition of dependency network to include inconsistent dependency networks and provide algorithms for learning such networks that are more computationally efficient than those just described. In the remainder of this section, we apply well known results about probabilistic inference to consistent dependency networks. This discussion will be useful for our further development of (general) dependency networks.

next up previous
Next: Probabilistic Inference Up: Consistent Dependency Networks Previous: Consistent Dependency Networks
Journal of Machine Learning Research, 2000-10-19