Supervised learning

A supervised scenario is characterized by the concept of a teacher or supervisor, whose main task is to provide the agent with a precise measure of its error (directly comparable with output values). With actual algorithms, this function is provided by a training set made up of couples (input and expected output). Starting from this information, the agent can correct its parameters so as to reduce the magnitude of a global loss function. After each iteration, if the algorithm is flexible enough and data elements are coherent, the overall accuracy increases and the difference between the predicted and expected values becomes close to zero. Of course, in a supervised scenario, the goal is training a system that must also work with samples that have never been seen before. So, it's necessary to allow the model to develop a generalization ability and avoid a common problem called overfitting, which causes overlearning due to an excessive capacity (we're going to discuss this in more detail in the following chapters, however, we can say that one of the main effects of such a problem is the ability to only correctly predict the samples used for training, while the error rate for the remaining ones is always very high).

In the following graph, a few training points are marked with circles, and the thin blue line represents a perfect generalization (in this case, the connection is a simple segment):

Example of regression of a stock price with different interpolating curves

Two different models are trained with the same datasets (corresponding to the two larger lines). The former is unacceptable because it cannot generalize and capture the fastest dynamics (in terms of frequency), while the latter seems a very good compromise between the original trend, and has a residual ability to generalize correctly in a predictive analysis.

Formally, the previous example is called regression because it's based on continuous output values. Instead, if there is only a discrete number of possible outcomes (called categories), the process becomes a classification. Sometimes, instead of predicting the actual category, it's better to determine its probability distribution. For example, an algorithm can be trained to recognize a handwritten alphabetical letter, so its output is categorical (in English, there'll be 26 allowed symbols). On the other hand, even for human beings, such a process can lead to more than one probable outcome when the visual representation of a letter isn't clear enough to belong to a single category. This means that the actual output is better described by a discrete probability distribution (for example, with 26 continuous values normalized so that they always sum up to 1). 

In the following graph, there's an example of classification of elements with two features. The majority of algorithms try to find the best separating hyperplane (in this case, it's a linear problem) by imposing different conditions. However, the goal is always the same: reducing the number of misclassifications and increasing the noise-robustness. For example, look at the triangular point that is closest to the plane (its coordinates are about [5.1 - 3.0]). If the magnitude of the second feature were affected by noise and so the value were quite smaller than 3.0, a slightly higher hyperplane could wrongly classify it. We're going to discuss some powerful techniques to solve these problems in later chapters:

Example of linear classification

Common supervised learning applications include the following:

  • Predictive analysis based on regression or categorical classification
  • Spam detection
  • Pattern detection
  • NLP
  • Sentiment analysis
  • Automatic image classification
  • Automatic sequence processing (for example, music or speech)