Kernel PCA

We're going to discuss kernel methods in Chapter 7, Support Vector Machines, however, it's useful to mention the KernelPCA class, which performs a PCA with non-linearly separable datasets. This approach is analogous to a standard PCA with a particular preprocessing step. Contrary to what many people can expect, a non-linear low-dimensional dataset can often become linearly separable when projected onto special higher-dimensional spaces. On the other hand, we prefer not to introduce a major complexity that could even result in an unsolvable problem. The kernel trick can help us in achieving this goal without the burden of hard, non-linear operations. The complete mathematical proofs are beyond the scope of this book, however, we can define a kernel as a real-valued vectorial function that has the following property:

In particular, if xi ∈ ?n, the vectorial transformation Ψ(xi) projects the original vector onto a larger subspace. In fact, its codomain is normally ?p with p > n. The advantage of the kernel K(xi, xj) derives from the fact that computing it with two values, xi and xj, its output corresponds to the dot product of the transformed vectors, therefore no additional computations are required. Now, let's consider the standard PCA transformation for a single sample (xi ∈ ?n × 1):

When using a Kernel PCA, we need to re-express the basis matrix W using the projection Wg = Ψ(x)vg, where vg are the eigenvectors computed considering the kernel transformation K instead of the covariance matrix C (remember that the dot product of a scalar is equivalent to a standard multiplication) and Ψ(x) is a matrix whose rows are Ψ(xi) for all xi. Therefore, the Kernel PCA transformation becomes the following:

However, the dot products can be efficiently computed using the kernel trick, and so the final computation corresponds to the extraction of the components of the dataset projected on the higher-dimensional subspace (where it's likely to be linearly separable). In other words, we are assuming that in a different (high-dimensional) space, there are components whose explained variance is negligible with respect to other ones. To understand this concept, let's consider a dataset made up of a circle with a blob inside it:

from sklearn.datasets import make_circles

Xb, Yb = make_circles(n_samples=500, factor=0.1, noise=0.05)

The graphical representation is shown in the following graph. In this case, a classic PCA approach isn't able to capture the non-linear dependency of existing components (the reader can verify that the projection is equivalent to the original dataset). However, looking at the samples and using polar coordinates, it's easy to separate the two sets, only considering the radius (in this case, the explained variance is no more a function of both components because the radius is almost constant):

A non-linearly separable dataset containing a dense blob surrounded by a circular set

Considering the structure of the dataset, it's possible to investigate the behavior of a PCA with a radial basis function kernel (which is sensitive to the distance from the origin):

As the default value for gamma is 1.0/number of features (for now, consider that this parameter as inversely proportional to the variance of a Gaussian σ2), we need to increase it to capture the external circle. A value of 1.0 is enough:

from sklearn.decomposition import KernelPCA

kpca = KernelPCA(n_components=2, kernel='rbf', fit_inverse_transform=True, gamma=1.0)
X_kpca = kpca.fit_transform(Xb)

The X_transformed_fit_ instance variable will contain the projection of our dataset into the new space. Plotting it, we get the following:

Kernel PCA graph where the left-hand strip represents the outer circle, while the smaller half-moon represents the inner blob

The plot shows a separation just like expected, and it's also possible to see that the points belonging to the central blob have a curve distribution because they are more sensitive to the distance from the center.

The Kernel PCA is a powerful instrument when we think of our dataset as being made up of elements that can be a function of components (in particular, radial-basis or polynomials), but we aren't able to determine a linear relationship among them.

For more information about the different kernels supported by scikit-learn, visit  http://scikit-learn.org/stable/modules/metrics.html#linear-kernel.