Cross-entropy and mutual information

If we have a target probability distribution p(x), which is approximated by another distribution q(x), a useful measure is cross-entropy between p and q (we are using the discrete definition as our problems must be solved by using numerical computations):

If the logarithm base is 2, it measures the number of bits requested to decode an event drawn from P when using a code optimized for Q. In many machine learning problems, we have a source distribution and we need to train an estimator to be able to correctly identify the class of a sample. If the error is null, P = Q and the cross-entropy is minimum (corresponding to the entropy H(P)). However, as a null error is almost impossible when working with Q, we need to pay a price of H(P, Q) bits to determine the right class starting from a prediction. Our goal is often to minimize it so as to reduce this price under a threshold that cannot alter the predicted output if not paid. In other words, think about a binary output and a sigmoid function: we have a threshold of 0.5 (this is the maximum price we can pay) to identify the correct class using a step function (0.6 -> 1, 0.1 -> 0, 0.4999 -> 0, and so on). As we're not able to pay this price, since our classifier doesn't know the original distribution, it's necessary to reduce the cross-entropy under a tolerable noise-robustness threshold (which is always the smallest achievable one).

In order to understand how a machine learning approach is performing, it's also useful to introduce a conditional entropy or the uncertainty of X given the knowledge of Y:

Through this concept, it's possible to introduce the idea of mutual information, which is the amount of information shared by both variables and therefore, the reduction of uncertainty about X provided by the knowledge of Y:

Intuitively, when X and Y are independent, they don't share any information. However, in machine learning tasks, there's a very tight dependence between an original feature and its prediction, so we want to maximize the information shared by both distributions. If the conditional entropy is small enough (so Y is able to describe X quite well), the mutual information gets close to the marginal entropy H(X), which measures the amount of information we want to learn.

An interesting learning approach based on the information theory, called minimum description length (MDL), is discussed in The Minimum Description Length Principle in Coding and Modeling, Barron A., Rissanen J., Yu B., IEEE Transaction on Information Theory, Vol. 44/6, 10/1998.