Introduction to statistical learning concepts

Imagine that you need to design a spam-filtering algorithm, starting from this initial (over-simplistic) classification based on two parameters:

 

We have collected 200 email messages (X) (for simplicity, we consider p1 and p2 as mutually exclusive) and we need to find a couple of probabilistic hypotheses (expressed in terms of p1 and p2), to determine the following:

We also assume the conditional independence of both terms (it means that hp1 and hp2 contribute in conjunction to spam in the same way as they would alone).

For example, we could think about rules (hypotheses) like so: "If there are more than five blacklisted words" or "If the message is less than 20 characters in length" then "the probability of spam is high" (for example, greater than 50%). However, without assigning probabilities, it's difficult to generalize when the dataset changes (like in a real-world antispam filter). We also want to determine a partitioning threshold (such as green, yellow, and red signals) to help the user in deciding what to keep and what to trash.

As the hypotheses are determined through the dataset X, we can also write (in a discrete form) the following:

In this example, it's quite easy to determine the value of each term. However, in general, it's necessary to introduce the Bayes formula (which will be discussed in Chapter 6, Naive Bayes and Discriminant Analysis):

The proportionality is necessary to avoid the introduction of the marginal probability P(X), which only acts as a normalization factor (remember that, in a discrete random variable, the sum of all possible probability outcomes must be equal to 1).

In the previous equation, the first term is called A Posteriori (which comes after) probability, because it's determined by a marginal Apriori (which comes first) probability multiplied by a factor called likelihood. To understand the philosophy of such an approach, it's useful to take a simple example: tossing a fair coin. Everybody knows that the marginal probability of each face is equal to 0.5, but who decided that? It's a theoretical consequence of logic and probability axioms (a good physicist would say that it's never 0.5 because of several factors that we simply discard). After tossing the coin 100 times, we observe the outcomes and, surprisingly, we discover that the ratio between heads and tails is slightly different (for example, 0.46). How can we correct our estimation? The term called likelihood measures how much our actual experiments confirm the Apriori hypothesis and determine another probability (A Posteriori), which reflects the actual situation. The likelihood, therefore, helps us in correcting our estimation dynamically, overcoming the problem of a fixed probability.

In Chapter 6, Naive Bayes and Discriminant Analysis, which is dedicated to naive Bayes algorithms, we're going to discuss these topics deeply and implement a few examples with scikit-learn, however, it's useful to introduce two statistical learning approaches that are very diffused.