Data formats

In both supervised and unsupervised learning problems, there will always be a dataset, defined as a finite set of real vectors with m features each:

Considering that our approach is always probabilistic, we need to assume each X as drawn from a statistical multivariate distribution, D, that is commonly known as a data generating process (the probability density function is often denoted as pdata(x)). For our purposes, it's also useful to add a very important condition upon the whole dataset X: we expect all samples to be independent and identically distributed (i.i.d). This means that all variables belong to the same distribution, D, and considering an arbitrary subset of k values, it happens that the following is true:

It's fundamental to understand that all machine learning tasks are based on the assumption of working with well-defined distributions (even if they can be partially unknown), and the actual datasets are made up of samples drawn from it. In the previous chapter, Chapter 1, A Gentle Introduction to Machine Learning, we defined the concept of learning considering the interaction between an agent and an unknown situation. This is possible because of the ability to learn a representation of the distribution and not the dataset itself! Hence, from now on, whenever a finite dataset is employed, the reader must always consider the possibility of coping with new samples that share the same distribution.

The corresponding output values can be both numerical-continuous or categorical. In the first case, the process is called regression, while in the second, it is called classification. Examples of numerical outputs are as follows:

When the label can assume a finite number of values (for example, it's binary or bipolar), the problem is discrete (also known as categorical, considering that each label is normally associated with a well-defined class or category), while it's called continuous when yi ∈ ?.

Other categorical examples are as follows:

We define a generic regressor, a vector-valued function, r(?), which associates an input value to a continuous output and generic classifier, and a vector-valued function c(?), whose predicted output is categorical (discrete). If they also depend on an internal parameter vector that determines the actual instance of a generic predictor, the approach is called parametric learning:

The vector θ is a summary of all model parameters which, in general, are the only elements we are going to learn. In fact, the majority of models assume a standard structure that cannot be modified (even if there are some particular dynamic neural networks that allows adding or removing computational units), and the adaptability relies only on the range of possible parameters.

On the other hand, non-parametric learning doesn't make initial assumptions about the family of predictors (for example, defining a generic parameterized version of r(?) and c(?)). A very common non-parametric family is called instance-based learning and makes real-time predictions (without pre-computing parameter values) based on a hypothesis that's only determined by the training samples (instance set). A simple and widespread approach adopts the concept of neighborhoods (with a fixed radius). In a classification problem, a new sample is automatically surrounded by classified training elements and the output class is determined considering the preponderant one in the neighborhood. In this book, we're going to talk about another very important algorithm family belonging to this class: kernel-based Support Vector Machines.

The internal dynamics and the interpretation of all elements are peculiar to every single algorithm, and for this reason, we prefer not to talk about thresholds or probabilities and try to work with an abstract definition. A generic parametric training process must find the best parameter vector that minimizes the regression/classification error given a specific training dataset, and it should also generate a predictor that can correctly generalize when unknown samples are provided.

Another interpretation can be expressed in terms of additive noise:

For our purposes, we can expect zero-mean and low-variance Gaussian noise to be added to a perfect prediction. A training task must increase the signal-noise ratio by optimizing the parameters. Of course, whenever such a term doesn't have a null mean (independently from the other X values), it probably implies that there's a hidden trend that must be taken into account (maybe a feature that has been prematurely discarded). On the other hand, high noise variance means that X is extremely corrupted and its measures are not reliable.

In unsupervised learning, we normally only have an input set X with m-length vectors, and we define the clustering function cl(?) (with n target clusters) with the following expression:

As explained in the previous chapter, Chapter 1, A Gentle Introduction to Machine Learning, a clustering algorithm tries to discover similarities among samples and group them accordingly; therefore cl(?) will always output a label between 0 and n-1 (or, alternatively, between 1 and n), representing the cluster that best matches the sample x. As x is assumed to be drawn from the same data generating process used during the training phase, we are mathematically authorized to accept the result as reliable in the limit of the accuracy that has been achieved. On the other hand (this is true in every machine learning problem), if x is drawn from a completely different distribution, any prediction will be indistinguishable from a random one. This concept is extremely important and the reader must understand it (together with all the possible implications). Let's suppose that we classify images of Formula 1 cars and military planes with a final accuracy of 95%. This means that only five photos representing actual cars or planes are misclassified. This is probably due to the details, the quality of the photo, the shape of the objects, the presence of noise, and so on. Conversely, if we try to classify photos of SUVs and large cargo planes, all of the results are meaningless (even if they can be correct). This happens because the classifier will seldom output a classification probability of 50% (meaning that the uncertainty is maximum), and the final class will always be one of the two. However, the awareness of the classifier is not very different from that of an oracle that tosses a coin. Therefore, whenever we need to work with specific samples, we must be sure to train the model with elements drawn from the same distribution. In the previous example, it's possible to retrain the classifier with all types of cars and planes, trying, at the same time, to reach the same original accuracy.

In most scikit-learn models, there is a coef_ instance variable which contains all of the trained parameters. For example, in a single parameter linear regression (we're going to widely discuss it in the following chapters), the output will be as follows:

model = LinearRegression()
model.fit(X, Y)
print(model.coef_)
array([ 9.10210898])