Principal Component Analysis

In many cases, the dimensionality of the input dataset X is high and so is the complexity of every related machine learning algorithm. Moreover, the information is seldom spread uniformly across all the features and, as discussed in the previous chapter, Chapter 2, Important Elements in Machine Learning, there will be high-entropy features together with low-entropy ones, which, of course, don't contribute dramatically to the final outcome. This concept can also be expressed by considering a fundamental assumption of semi-supervised learning, called the manifold assumption. It states (without a formal proof, as it's an empirical hypothesis) that data with high dimensionality normally lies on lower-dimensional manifolds. If the reader is not familiar with the concept of a manifold, it's not necessary for our purpose to provide a complete rigorous definition. It's enough to say that a manifold is a non-Euclidean space that locally behaves like a Euclidean one. The simplest example is a spherical surface. Picking a point, it's possible to find a circle whose curvature is negligible and thus it resembles a circle on a bidimensional Euclidean space. In many machine learning and data science tasks, n-dimensional vectors often spread uniformly over all ?ⁿ. This means that there are dimensions whose associated information is almost null. However, this is not an error or a bad condition, because, in many cases, it's simpler to add dimensions to better summarize the data. On the other hand, an algorithm is agnostic to this kind of descriptive choice and, in many cases, it performs better when the dimensionality is reduced to its minimum value (cfr. Hughes phenomenon).

In general, if we consider a Euclidean space, we get the following:

So, each point is expressed by using an orthonormal basis made of m linearly independent vectors (in general, this is the canonical base associated to ?^m). Now, considering a dataset X, a natural question arises: how is it possible to reduce m without a drastic loss of information?

Let's consider the following graph (without any particular interpretation):

Sample bidimensional dataset with a larger horizontal variance

It doesn't matter which distributions generated X=(x, y), however, the variance of the horizontal component is clearly higher than that of the vertical one. As discussed previously, this means that the amount of information provided by the first component is higher than the second and, for example, the x axis is stretched horizontally, keeping the vertical one fixed, and so the distribution becomes similar to a segment where the depth has lower and lower importance.

In order to assess how much information is carried by each component, and the correlation among them, a useful tool is the covariance matrix (if the dataset has zero mean, we can use the correlation matrix):

C is symmetric and a positive semidefinite, so all the eigenvalues are non-negative, but what's the meaning of each value? On the diagonal, there are the variances of each component σ_i² which encodes the deviation around the mean. Larger variances imply more spread components, while σ_i² = 0 is equivalent to a situation where all the values are equal to the mean. The elements σ_iσ_j are the cross-variances (or cross-correlations) between components and their values are determined by the linear dependencies. Through an affine transform, it's possible to find a new basis where all these elements are null, as shown in the following graph:

Dataset with non-null cross-covariances (left), and the rotated (with null-covariance) version (right)

In the first case, the dataset is rotated, therefore the components x₁ and x₂ become dependent on each other with respect to the canonical basis. Applying the affine transform and obtaining a new basis, the components become uncorrelated. Considering the previous example (without rotations), the covariance matrix is as follows:

As expected, the horizontal variance is quite a bit higher than the vertical one. Moreover, the other values are close to zero. If you remember the definition and, for simplicity, remove the mean term, they represent the cross-correlation between couples of components. It's obvious that, in our example, X and Y are uncorrelated (they're orthogonal), but in real-life examples, there could be features that present a residual cross-correlation. In terms of information theory, it means that knowing Y gives us some information about X (which we already know), so they share information that is indeed doubled. So, our goal is to also decorrelate X while trying to reduce its dimensionality. The magnitude of the eigenvalues of C is proportional to the amount of information carried by the referenced component, and the associated eigenvector provides us with the direction of such a component. Hence, let's suppose that we compute the sorted set of eigenvalues Λ:

If we select g < m values, the following relation holds for the relative subspace Λ_g:

At this point, we can build a transformation matrix based on the first g eigenvectors:

As X ∈ ?^{n × m}, we obtain the projections with the matrix multiplication:

So, it's possible to project the original feature vectors into this new (sub)space, where each component carries a portion of total variance and where the new covariance matrix is decorrelated to reduce useless information sharing (in terms of correlation) among different features. In other words, as we are considering the eigenvectors associated with the top g eigenvalues, the transformed dataset will be rotated so that W is now the canonical basis and all non-diagonal components are forced to zero. Before showing a practical example, it's useful to briefly discuss how to perform the eigendecomposition of C. Clearly, this operation can be done directly, but it's simpler to employ the SVD. For a generic matrix A ∈ ?^{m × n} (the same result is valid for complex matrices, but in our case, we are going to use only real-valued ones), the SVD is a factorization that expresses A in the following way:

We are going to employ this transformation in other contexts, but for now it's enough to say that the elements of Σ (called singular values) are the square root of the eigenvalues of both AA^T and A^TA, and that the columns of V (called right singular vectors) are the eigenvectors of A^TA. Therefore, we don't need to compute the covariance matrix. If we apply the SVD to dataset X, we can immediately obtain the transformation matrix. The scikit-learn library provides an implementation called TruncatedSVD, which performs the SVD, limited to the first top eigenvalues, and this is the most efficient way to perform a PCA. However, for simplicity, there's also the PCA class, which can do all of this in a very smooth way (without the need of any further operations):

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA

digits = load_digits()

A screenshot with a few random MNIST handwritten digits follows:

Examples of MNIST digits

Each image is a vector of 64 unsigned int (8 bit) numbers (0, 255), so the initial number of components is indeed 64. However, the total amount of black pixels is often predominant and the basic signs needed to write 10 digits are similar, so it's reasonable to assume both high cross-correlation and a low variance on several components. Trying with 36 principal components, we get the following:

pca = PCA(n_components=36, whiten=True)
X_pca = pca.fit_transform(digits.data / 255)

In order to improve performance, all integer values are normalized into the range [0, 1] and, through the whiten=True parameter, the variance of each diagonal component is scaled to 1 (the covariance matrix is already decorrelated). As the official scikit-learn documentation says, this process is particularly useful when an isotropic distribution is needed for many algorithms to perform efficiently. It's possible to access the explained variance ratio through the explained_variance_ratio_ instance variable, which shows which part of the total variance is carried by every single component:

print(pca.explained_variance_ratio_)
array([ 0.14890594,  0.13618771,  0.11794594,  0.08409979,  0.05782415,
        0.0491691 ,  0.04315987,  0.03661373,  0.03353248,  0.03078806,
        0.02372341,  0.02272697,  0.01821863,  0.01773855,  0.01467101,
        0.01409716,  0.01318589,  0.01248138,  0.01017718,  0.00905617,
        0.00889538,  0.00797123,  0.00767493,  0.00722904,  0.00695889,
        0.00596081,  0.00575615,  0.00515158,  0.00489539,  0.00428887,
        0.00373606,  0.00353274,  0.00336684,  0.00328029,  0.0030832 ,
        0.00293778])

A plot for the example of MNIST digits is shown next. The left-hand graph represents the variance ratio, while the right-hand one is the cumulative variance. It can be immediately seen how the first components are normally the most important ones in terms of information, while the following ones provide details that a classifier could also discard:

Explained variance per component (left), and the cumulative variance (right)

As expected, the contribution to the total variance decreases dramatically starting from the fifth component, so it's possible to reduce the original dimensionality without an unacceptable loss of information, which could drive an algorithm to learn incorrect classes. In the preceding graphs, you can see the same handwritten digits but rebuilt by using the first 36 components with whitening and normalization between 0 and 1. To obtain the original images, we need to inverse-transform all the new vectors and project them into the original space:

X_rebuilt = pca.inverse_transform(X_pca)

The result is shown in the following screenshot:

MNIST digits obtained by rebuilding them from the principal components

This process can also partially denoise the original images by removing residual variance, which is often associated with noise or unwanted contributions (almost every calligraphy distorts some of the structural elements that are used for recognition).

I suggest that the reader try different numbers of components (using the explained variance data) and also n_components='mle', which implements an automatic selection of the best dimensionality using a technique based on a Bayesian model selection (for further information, please refer to Automatic Choice of Dimensionality for PCA, Minka T.P, NIPS 2000: 598-604, 2000).

As explained, scikit-learn solves the PCA problem with SVD. It's possible to control the algorithm through the svd_solver parameter, whose values are 'auto', 'full', 'arpack', 'randomized'. ARnoldi PACKage ( ARPACK) implements a truncated SVD. randomized is based on an approximate algorithm that drops many singular vectors and can achieve very good performances with high-dimensional datasets where the actual number of components is sensibly smaller.