Supervised learning in practice with Python

As we said earlier, supervised learning algorithms learn to approximate a function by mapping inputs and outputs to create a model that is able to predict future outputs given unseen inputs.

It's conventional to denote inputs as x and outputs as y; both can be numerical or categorical.

We can distinguish them as two different types of supervised learning:

  • Classification
  • Regression

Classification is a task where the output variable can assume a finite amount of elements, called categories. An example of classification would be classifying different types of flowers (output) given the sepal length (input). Classification can be further categorized in more sub types:

  • Binary classification: The task of predicting whether an instance belongs either to one class or the other
  • Multiclass classification: The task (also known as multinomial) of predicting the most probable label (class) for each single instance
  • Multilabel classification: When multiple labels can be assigned to each input

Regression is a task where the output variable is continuous. Here are some common regression algorithms:

  • Linear regression: This finds linear relationships between inputs and outputs
  • Logistic regression: This finds the probability of a binary output

In general, the supervised learning problem is solved in a standard way by performing the following steps:

  1. Performing data cleaning to make sure the data we are using is as accurate and descriptive as possible.
  2. Executing the feature engineering process, which involves the creation of new features out of the existing ones for improving the algorithm's performance.
  3. Transforming input data into something that our algorithm can understand, which is known as data transformation. Some algorithms, such as neural networks, don't work well with data that is not scaled as they would naturally give more importance to inputs with a larger magnitude.
  4. Choosing an appropriate model (or a few of them) for the problem.
  5. Choosing an appropriate metric to measure the effectiveness of our algorithm.
  6. Train the model using a subset of the available data, called the training set. On this training set, we calibrate the data transformations.
  7. Testing the model.