Unsupervised learning

This approach is based on the absence of any supervisor and therefore of absolute error measures. It's useful when it's necessary to learn how a set of elements can be grouped (clustered) according to their similarity (or distance measure). For example, looking at the previous graph, a human being can immediately identify two sets without considering the colors or the shapes. In fact, the circular dots (as well as the triangular ones) determine a coherent set; it is separate from the other one much more than how its points are internally separated. Using a metaphor, an ideal scenario is a sea with a few islands that can be separated from each other, considering only their mutual position and internal cohesion. Clearly, unsupervised learning provides an implicit descriptive analysis because all the pieces of information discovered by the clustering algorithm can be used to obtain a complete insight of the dataset. In fact, all objects share a subset of features, while they are different under other viewpoints. The aggregation process is also aimed to extend the characteristics of some points to their neighbors, assuming that the similarity is not limited to some specific features. For example, in a recommendation engine, a group of users can be clustered according to the preference expressed for some books. If the chosen criteria detected some analogies between users A and B, we can share the non-overlapping elements between the users. Therefore, if A has read a book that can be suitable for B, we are implicitly authorized to recommend it. In this case, the decision is made by considering a goal (sharing the features) and a descriptive analysis. However, as the model can (and should) manage unknown users too, its purpose is also predictive.

In the following graph, each ellipse represents a cluster and all the points inside its area can be labeled in the same way. There are also boundary points (such as the triangles overlapping the circle area) that need a specific criterion (normally a trade-off distance measure) to determine the corresponding cluster. Just as for classification with ambiguities (P and malformed R), a good clustering approach should consider the presence of outliers and treat them so as to increase both the internal coherence (visually, this means picking a subdivision that maximizes the local density) and the separation among clusters.

For example, it's possible to give priority to the distance between a single point and a centroid, or the average distance among points belonging to the same cluster and different ones. In this graph, all boundary triangles are close to each other, so the nearest neighbor is another triangle. However, in real-life problems, there are often boundary areas where there's a partial overlap, meaning that some points have a high degree of uncertainty due to their feature values:

Example of clustering with a bidimensional dataset split into four natural clusters

Another interpretation can be expressed by using probability distributions. If you look at the ellipses, they represent the area of multivariate Gaussians bound between a minimum and maximum variance. Considering the whole domain, a point (for example, a blue star) could potentially belong to all clusters, but the probability given by the first one (lower-left corner) is the highest, and so this determines the membership. Once the variance and mean (in other words, the shape) of all Gaussians become stable, each boundary point is automatically captured by a single Gaussian distribution (except in the case of equal probabilities). Technically, we say that such an approach maximizes the likelihood of a Gaussian mixture given a certain dataset. This is a very important statistical learning concept that spans many different applications, so it will be examined in more depth in the next chapter, Chapter 2, Important Elements in Machine Learning. Moreover, we're going to discuss some common clustering methodologies, considering both strong and weak points, and compare their performances for various test distributions.

Other important techniques involve the use of both labeled and unlabeled data. This approach is therefore called semi-supervised and can be adopted when it's necessary to categorize a large amount of data with a few complete (labeled) examples or when there's the need to impose some constraints to a clustering algorithm (for example, assigning some elements to a specific cluster or excluding others).

Commons unsupervised applications include the following:

Object segmentation (for example, users, products, movies, songs, and so on)
Similarity detection
Automatic labeling
Recommendation engines