next up previous
Next: Bibliography Up: Learning with Mixtures of Previous: Conclusions


Appendix A.

In this appendix we prove the following theorem from Section 6.2: Theorem Let $u,v,w$ be discrete variables such that $v, w$ do not co-occur with $u$ (i.e., $u\neq0\;\Rightarrow \;v=w=0$ in a given dataset ${\cal D}$). Let $N_{v0},N_{w0}$ be the number of data points for which $v=0, w=0$ respectively, and let $I_{uv},I_{uw}$ be the respective empirical mutual information values based on the sample ${\cal D}$. Then

\begin{displaymath}
N_{v0} \;>\; N_{w0}\;\;\Rightarrow\;\;I_{uv} \;\leq\;I_{uw}
\end{displaymath}

with equality only if $u$ is identically 0. Proof. We use the notation:

\begin{displaymath}
P_v(i) \;=\;\frac{N_v^i}{N},\;\;\;i \neq 0;\;\;\;
P_{v0}\;\equiv\;P_v(0)\; = \;1 - \sum_{i\neq 0}P_v(i).
\end{displaymath}

These values represent the (empirical) probabilities of $v$ taking value $i\neq 0$ and 0 respectively. Entropies will be denoted by $H$. We aim to show that $\frac{\partial I_{uv}}{\partial P_{v0}} < 0$. We first note a ``chain rule'' expression for the entropy of a discrete variable. In particular, the entropy $H_v$ of any multivalued discrete variable $v$ can be decomposed in the following way:
$\displaystyle H_v$ $\textstyle =$ $\displaystyle \underbrace{ -P_{v0}\log P_{v0} - (1-P_{v0})\log (1-P_{v0})}_{H_{v0}}$  
    $\displaystyle -(1-P_{v0})\underbrace{\sum_{i\neq 0} \frac{P_v(i)}{(1-P_{v0})} \log \frac{P_v(i)}{(1-P_{v0})}}_{-H_{\overline{v}}}$  
  $\textstyle =$ $\displaystyle H_{v0}+(1-P_{v0})H_{\overline{v}}.$ (12)

Note moreover that the mutual information of two non-co-occurring variables is $I_{uv}\;=\;H_u-H_{u\vert v}$. The second term, the conditional entropy of $u$ given $v$ is

\begin{displaymath}
H_{u\vert v}\;=\;P_{v0}H_{u\vert v=0} +\sum_{j\neq 0} P_v(j)
\underbrace{H_{u\vert v=j}}_{0}.
\end{displaymath}

We now expand $H_{u\vert v=0}$ using the decomposition in (12):

\begin{displaymath}
H_{u\vert v=0}\;=\;H_{u0\vert v=0}+(1-P_{u=0\vert v=0})H_{u\vert u\neq 0,v=0}.
\end{displaymath}

Because $u$ and $v$ are never non-zero at the same time, all non-zero values of $u$ are paired with zero values of $v$. Consequently $Pr[u=i\vert u\not=0,v=0]=Pr[u=i\vert u\neq0]$ and $H_{u\vert u\neq 0,v=0}\;=\;H_{\overline{u}}$. The term denoted $H_{u0\vert v=0}$ is the entropy of a binary variable whose probability is $Pr[u=0\vert v=0]$. This probability equals

\begin{displaymath}
Pr[u=0\vert v=0]\;=\;1-\frac{1-P_{u0}}{1-P_{v0}}.
\end{displaymath}

Note that in order to obtain a non-negative probability in the above equation one needs $1-P_{u0}\;\leq\;P_{v0}$, a condition that is always satisfied if $u$ and $v$ do not co-occur. Replacing the previous three equations in the formula of the mutual information, we get

\begin{displaymath}
I_{uv}\;=\;P_{u0}\log P_{u0}-P_{v0}\log P_{v0} + (P_{u0}+P_{v0}-1) \log (P_{u0}+P_{v0}-1).
\end{displaymath}

This expression, remarkably, depends only on $P_{u0}$ and $P_{v0}$. Taking its partial derivative with respect to $P_{v0}$ yields

\begin{displaymath}
\frac{\partial I_{uv}}{\partial P_{v0}}\;=\;\log\frac{P_{v0}+P_{u0}-1}{P_{v0}}\;\leq\;0,
\end{displaymath}

a value that is always negative, independently of $P_{v0}$. This shows the mutual information increases monotonically with the ``occurrence frequency'' of $v$ given by $1-P_{v0}$. Note also that the above expression for the derivative is the same as the result obtained for binary variables in (11).
next up previous
Next: Bibliography Up: Learning with Mixtures of Previous: Conclusions
Journal of Machine Learning Research 2000-10-19