Entropy

The most useful measure in information theory (as well as in machine learning) is called entropy:

This value is proportional to the uncertainty of X and is measured in bits (if the logarithm has another base, this unit can change too). For many purposes, a high entropy is preferable, because it means that a certain feature contains more information. For example, in tossing a coin (two possible outcomes), H(X) = 1 bit, but if the number of outcomes grows, even with the same probability, H(X) also does because of a higher number of different values, and therefore has increased variability. It's possible to prove that for a Gaussian distribution (using natural logarithm):

So, the entropy is proportional to the variance, which is a measure of the amount of information carried by a single feature. In the next chapter, Chapter 3, Feature Selection and Feature Engineering, we're going to discuss a method for feature selection based on the variance threshold. Gaussian distributions are very common, so this example can just be considered as a general approach to feature filtering: low variance implies low information level and a model could often discard all of those features.

In the following graph, there's a plot of H(X) for a Gaussian distribution expressed in nats (which is the corresponding unit measure when using natural logarithms):

The entropy of a normal distribution as a function of the standard deviation

For example, if a dataset is made up of some features whose variance (here, it's more convenient to talk about standard deviation) is bounded between 8 and 10 and a few with STD < 1.5, the latter could be discarded with a limited loss in terms of information. These concepts are very important in real-life problems when large datasets must be cleaned and processed in an efficient way. Another important measure associated with the entropy is called perplexity:

If the entropy is computed by using log₂, this measure is immediately associable with the total number of possible outcomes with the same probability. Whenever natural logarithm is used, the measure is still proportional to this quantity. Perplexity is very useful for assessing the amount of uncertainty in a distribution (it can be either a data distribution or the output of a model). For example, if we have a fair coin (p_head = p_tail = 0.5), the entropy is equal to the following:

The perplexity is equal to 2, meaning that we expect 2 equally probable outcomes. As the total number of classes is 2, this situation is unacceptable, because we have no way to decide the right outcome (which is the entire goal of the experiment). On the other hand, if we have performed a previous analysis (imagine that the model has done it) and p_head = 0.8, the entropy becomes the following:

The perplexity drops to about 1.38 because, obviously, one outcome is not more likely than the other one.