Next: Bibliography Up: Learning with Mixtures of Previous: Conclusions

Appendix A.

In this appendix we prove the following theorem from Section 6.2: Theorem Let be discrete variables such that do not co-occur with (i.e., $u\neq0\;\Rightarrow \;v=w=0$ in a given dataset ${\cal D}$ ). Let $N_{v0},N_{w0}$ be the number of data points for which respectively, and let $I_{uv},I_{uw}$ be the respective empirical mutual information values based on the sample ${\cal D}$ . Then

$\begin{displaymath} N_{v0} \;>\; N_{w0}\;\;\Rightarrow\;\;I_{uv} \;\leq\;I_{uw} \end{displaymath}$

with equality only if is identically 0. Proof. We use the notation:

$\begin{displaymath} P_v(i) \;=\;\frac{N_v^i}{N},\;\;\;i \neq 0;\;\;\; P_{v0}\;\equiv\;P_v(0)\; = \;1 - \sum_{i\neq 0}P_v(i). \end{displaymath}$

These values represent the (empirical) probabilities of

taking value $i\neq 0$ and 0 respectively. Entropies will be denoted by

. We aim to show that $\frac{\partial I_{uv}}{\partial P_{v0}} < 0$ . We first note a ``chain rule'' expression for the entropy of a discrete variable. In particular, the entropy

of any multivalued discrete variable

can be decomposed in the following way:

$\displaystyle H_v$	$\textstyle =$	$\displaystyle \underbrace{ -P_{v0}\log P_{v0} - (1-P_{v0})\log (1-P_{v0})}_{H_{v0}}$
		$\displaystyle -(1-P_{v0})\underbrace{\sum_{i\neq 0} \frac{P_v(i)}{(1-P_{v0})} \log \frac{P_v(i)}{(1-P_{v0})}}_{-H_{\overline{v}}}$
	$\textstyle =$	$\displaystyle H_{v0}+(1-P_{v0})H_{\overline{v}}.$	(12)

Note moreover that the mutual information of two non-co-occurring variables is $I_{uv}\;=\;H_u-H_{u\vert v}$ . The second term, the conditional entropy of

given

$\begin{displaymath} H_{u\vert v}\;=\;P_{v0}H_{u\vert v=0} +\sum_{j\neq 0} P_v(j) \underbrace{H_{u\vert v=j}}_{0}. \end{displaymath}$

We now expand $H_{u\vert v=0}$ using the decomposition in (12):

$\begin{displaymath} H_{u\vert v=0}\;=\;H_{u0\vert v=0}+(1-P_{u=0\vert v=0})H_{u\vert u\neq 0,v=0}. \end{displaymath}$

Because

and

are never non-zero at the same time, all non-zero values of

are paired with zero values of

. Consequently $Pr[u=i\vert u\not=0,v=0]=Pr[u=i\vert u\neq0]$ and $H_{u\vert u\neq 0,v=0}\;=\;H_{\overline{u}}$ . The term denoted $H_{u0\vert v=0}$ is the entropy of a binary variable whose probability is $Pr[u=0\vert v=0]$ . This probability equals

$\begin{displaymath} Pr[u=0\vert v=0]\;=\;1-\frac{1-P_{u0}}{1-P_{v0}}. \end{displaymath}$

Note that in order to obtain a non-negative probability in the above equation one needs $1-P_{u0}\;\leq\;P_{v0}$ , a condition that is always satisfied if

and

do not co-occur. Replacing the previous three equations in the formula of the mutual information, we get

$\begin{displaymath} I_{uv}\;=\;P_{u0}\log P_{u0}-P_{v0}\log P_{v0} + (P_{u0}+P_{v0}-1) \log (P_{u0}+P_{v0}-1). \end{displaymath}$

This expression, remarkably, depends only on $P_{u0}$ and $P_{v0}$ . Taking its partial derivative with respect to $P_{v0}$ yields

$\begin{displaymath} \frac{\partial I_{uv}}{\partial P_{v0}}\;=\;\log\frac{P_{v0}+P_{u0}-1}{P_{v0}}\;\leq\;0, \end{displaymath}$

a value that is always negative, independently of $P_{v0}$ . This shows the mutual information increases monotonically with the ``occurrence frequency'' of

given by $1-P_{v0}$ . Note also that the above expression for the derivative is the same as the result obtained for binary variables in (11).

Next: Bibliography Up: Learning with Mixtures of Previous: Conclusions

Journal of Machine Learning Research 2000-10-19