Next: Bibliography Up: Learning with Mixtures of Previous: Conclusions

# Appendix A.

In this appendix we prove the following theorem from Section 6.2: Theorem Let be discrete variables such that do not co-occur with (i.e., in a given dataset ). Let be the number of data points for which respectively, and let be the respective empirical mutual information values based on the sample . Then

with equality only if is identically 0. Proof. We use the notation:

These values represent the (empirical) probabilities of taking value and 0 respectively. Entropies will be denoted by . We aim to show that . We first note a chain rule'' expression for the entropy of a discrete variable. In particular, the entropy of any multivalued discrete variable can be decomposed in the following way:
 (12)

Note moreover that the mutual information of two non-co-occurring variables is . The second term, the conditional entropy of given is

We now expand using the decomposition in (12):

Because and are never non-zero at the same time, all non-zero values of are paired with zero values of . Consequently and . The term denoted is the entropy of a binary variable whose probability is . This probability equals

Note that in order to obtain a non-negative probability in the above equation one needs , a condition that is always satisfied if and do not co-occur. Replacing the previous three equations in the formula of the mutual information, we get

This expression, remarkably, depends only on and . Taking its partial derivative with respect to yields

a value that is always negative, independently of . This shows the mutual information increases monotonically with the occurrence frequency'' of given by . Note also that the above expression for the derivative is the same as the result obtained for binary variables in (11).

Next: Bibliography Up: Learning with Mixtures of Previous: Conclusions
Journal of Machine Learning Research 2000-10-19