Lasso

Lasso regularization is based on the L1-norm of the parameter vector:

Contrary to Ridge, which shrinks all the weights, Lasso can shift the smallest one to zero, creating a sparse parameter vector. The mathematical proof is beyond the scope of this book; however, it's possible to understand it intuitively by considering the following diagram (bidimensional):

Lasso (L1) regularization

The zero-centered square represents the Lasso boundaries. If we consider a generic line, the probability of being tangential to the square is higher at the corners, where at least one (exactly one in a bidimensional scenario) parameter is null. In general, if we have a vectorial convex function f(x) (we provide a definition of convexity in Chapter 5, EM Algorithm and Applications), we can define:

As any Lp-norm is convex, as well as the sum of convex functions, g(x) is also convex. The regularization term is always non-negative, therefore the minimum corresponds to the norm of the null vector. When minimizing g(x), we need to also consider the contribution of the gradient of the norm in the ball centered in the origin where, however, the partial derivatives don't exist. Increasing the value of p, the norm becomes smoothed around the origin, and the partial derivatives approach zero for |xi| → 0.

On the other side, with p=1 (excluding the L0-norm and all the norms with p ∈ ]0, 1[ that allow an even stronger sparsity, but are non-convex), the partial derivatives are always +1 or -1, according to the sign of xi (xi ≠ 0). Therefore, it's easier for the L1-norm to push the smallest components to zero, because the contribution to the minimization (for example, with a gradient descent) is independent of xiwhile an L2-norm decreases its speed when approaching the origin. This is a non-rigorous explanation of the sparsity achieved using the L1-norm. In fact, we also need to consider the term f(x), which bounds the value of the global minimum; however, it may help the reader to develop an intuitive understanding of the concept. It's possible to find further and mathematically rigorous details in Optimization for Machine Learning, (edited by) Sra S., Nowozin S., Wright S. J., The MIT Press.

Lasso regularization is particularly useful whenever a sparse representation of a dataset is needed. For example, we could be interested in finding the feature vectors corresponding to a group of images. As we expect to have many features but only a subset present in each image, applying the Lasso regularization allows forcing all the smallest coefficients to become null, suppressing the presence of the secondary features. Another potential application is latent semantic analysis, where our goal is to describe the documents belonging to a corpus in terms of a limited number of topics. All these methods can be summarized in a technique called sparse coding, where the objective is to reduce the dimensionality of a dataset (also in non-linear scenarios) by extracting the most representative atoms, using different approaches to achieve sparsity.