Explained variance

In a linear regression problem (as well as in a Principal Component Analysis (PCA)), it's helpful to know how much original variance can be explained by the model. This concept is useful to understand the amount of information that we lose by approximating the dataset. When this value is small, it means that the data generating process has strong oscillations and a linear model fails to capture them. A very simple but effective measure (not very different from R2) is defined as follows:

When Y is well approximated, the numerator is close to 0 and EV → 1, which is the optimal value. In all the other cases, the index represents the ratio between the variance of the errors and the variance of the original process. We can compute this score for our example using the same CV strategy employed previously:

print(cross_val_score(lr, X, Y, cv=10, scoring='explained_variance').mean())
0.271

The value, similarly to R2, is not acceptable, even if it's a little bit higher. It's obvious that the dynamics of the dataset cannot be modeled using a linear system because only a few features show a behavior which can be represented as a generic line with additive Gaussian noise. In general, when R2 is unacceptable, it doesn't make sense to compute other measures because the accuracy of the model will be normally low. Instead, it's preferable to analyze the data, to have a better understanding of the single time-series. In the Boston dataset, many values show an extremely non-linear behavior and the most regular features don't seem to be stationary. This means that there are (often unknown) events that alter dramatically the dynamics. If the time window is short, it's possible to imagine the presence of oscillations and seasonalities, but when the samples are collected over a sufficiently large period, it's more likely supposing the presence of factors that have not been included in the model. For example, the price of the houses can dramatically change after a natural catastrophe, and such a situation is clearly unpredictable. Whenever the number of irregular samples is small, it's possible to consider them as outliers and filter them out with particular algorithms, but when they are more frequent, it's better to employ a model that can learn non-linear dynamics. In this chapter, and in the rest of the book, we are going to discuss methods that can solve or mitigate these problems.