Atom extraction and dictionary learning

Dictionary learning is a technique that allows you to rebuild a sample starting from a sparse dictionary of atoms (similar to principal components, but without constraints about the independence). Conventionally, when the dictionary contains a number of elements less than the dimensionality of the samples m, it is called under-complete, and on the other hand it's called over-complete when the number of atoms is larger (sometimes much larger) than m.

In Online Dictionary Learning for Sparse Coding, Mairal J., Bach F., Ponce J., Sapiro G., Proceedings of the 29th International Conference on Machine Learning, 2009, there's a description of the same online strategy adopted by scikit-learn, which can be summarized as a double optimization problem.

Let's suppose that we have a dataset, X:

Our goal is to find both a dictionary D (that is a matrix containing m k-dimensional vectors) and a set of vectorial weights for each sample:

After the training process, an input vector x_i can be computed as follows:

The reader can recognize an approach which is very similar to what we discussed for PCA. In fact, each sample x_i is re-expressed using a transformation matrix and a projection α_i.

The optimization problem (which involves both D and alpha vectors) can be expressed as the minimization of the following loss function:

Here, the parameter c controls the level of sparsity (which is proportional to the strength of L1 normalization). This problem, analogously to NNMF, can be solved by alternating the least square variable until a stable point is reached. Dictionary learning can be used to perform a Sparse PCA, but, in some cases, it can be helpful to consider over-complete dictionaries with more complex datasets. For example, a set of pictures representing cars can be decomposed into all possible components, considering that a single sample must be rebuilt using a large number of atoms (even k >> m). In fact, when k << n, the components are forced to represent more complex parts, while k >> n allows extracting all the basic constituent elements.

A very important element to consider is the L1-normalization. Unfortunately, it's rather difficult to prove why this term forces sparseness (we are going to show some other examples in the next chapter), however, a very simple explanation can be obtained considering the L0-norm, which (rather informally) is the number of non-null elements of a vector. Minimizing the L0-norm means forcing all the components to reduce their magnitude until they become as close as possible to zero. Unfortunately, this norm is not differentiable and it cannot be used in an optimization process. The L1-norm is hence the best candidate because, using Lp-norms, the amount of sparseness decreases while p increases. This is not a rigorous proof, but it can be useful to understand the logic of this approach. In many other algorithms, the presence of an L1 penalty indicates that a specific quantity is desired to be sparse.

In scikit-learn, we can implement such an algorithm with the DictionaryLearning class (using the usual MNIST datasets), where n_components, as usual, determines the number of atoms:

from sklearn.decomposition import DictionaryLearning

dl = DictionaryLearning(n_components=36, fit_algorithm='lars', transform_algorithm='lasso_lars')
X_dict = dl.fit_transform(digits.data)

A plot of each atom (component) is shown in the following screenshot:

A set of atoms extracted from the MNIST dataset

In this example, the number of components is smaller than the dimensionality, and the atoms contain more complex shapes. I invite the reader to repeat this exercise with different values, observing the resultant plot and comparing the granularity of the features.

This process can be very long on low-end machines. In such a case, I suggest limiting the number of samples to 20 or 30 or that you work with smaller datasets.