Preprocessing using pipelines

When taking measurements of real-world objects, we can often get features in very different ranges. For instance, if we are measuring the qualities of an animal, we might have several features, as follows:

  • Number of legs: This is between the range of 0-8 for most animals, while some have many more!
  • Weight: This is between the range of only a few micrograms, all the way to a blue whale with a weight of 190,000 kilograms!
  • Number of hearts: This can be between zero to five, in the case of the earthworm.

For a mathematical-based algorithm to compare each of these features, the differences in the scale, range, and units can be difficult to interpret. If we used the above features in many algorithms, the weight would probably be the most influential feature due to only the larger numbers and not anything to do with the actual effectiveness of the feature.

One of the methods to overcome this is to use a process called preprocessing to normalize the features so that they all have the same range, or are put into categories like small, medium and large. Suddenly, the large difference in the types of features has less of an impact on the algorithm, and can lead to large increases in the accuracy.

Preprocessing can also be used to choose only the more effective features, create new features, and so on. Preprocessing in scikit-learn is done through Transformer objects, which take a dataset in one form and return an altered dataset after some transformation of the data. These don't have to be numerical, as Transformers are also used to extract features-however, in this section, we will stick with preprocessing.

An example

We can show an example of the problem by breaking the Ionosphere dataset. While this is only an example, many real-world datasets have problems of this form. First, we create a copy of the array so that we do not alter the original dataset:

X_broken = np.array(X)

Next, we break the dataset by piding every second feature by 10:

X_broken[:,::2] /= 10

In theory, this should not have a great effect on the result. After all, the values for these features are still relatively the same. The major issue is that the scale has changed and the odd features are now larger than the even features. We can see the effect of this by computing the accuracy:

estimator = KNeighborsClassifier()
original_scores = cross_val_score(estimator, X, y, scoring='accuracy')
print("The original average accuracy for is {0:.1f}%".format(np.mean(original_scores) * 100))
broken_scores = cross_val_score(estimator, X_broken, y, scoring='accuracy')
print("The 'broken' average accuracy for is {0:.1f}%".format(np.mean(broken_scores) * 100))

This gives a score of 82.3 percent for the original dataset, which drops down to 71.5 percent on the broken dataset. We can fix this by scaling all the features to the range 0 to 1.

Standard preprocessing

The preprocessing we will perform for this experiment is called feature-based normalization through the MinMaxScaler class. Continuing with the IPython notebook from the rest of this chapter, first, we import this class:

from sklearn.preprocessing import MinMaxScaler

This class takes each feature and scales it to the range 0 to 1. The minimum value is replaced with 0, the maximum with 1, and the other values somewhere in between.

To apply our preprocessor, we run the transform function on it. While MinMaxScaler doesn't, some transformers need to be trained first in the same way that the classifiers do. We can combine these steps by running the fit_transform function instead:

X_transformed = MinMaxScaler().fit_transform(X)

Here, X_transformed will have the same shape as X. However, each column will have a maximum of 1 and a minimum of 0.

There are various other forms of normalizing in this way, which is effective for other applications and feature types:

  • Ensure the sum of the values for each sample equals to 1, using sklearn.preprocessing.Normalizer
  • Force each feature to have a zero mean and a variance of 1, using sklearn.preprocessing.StandardScaler, which is a commonly used starting point for normalization
  • Turn numerical features into binary features, where any value above a threshold is 1 and any below is 0, using sklearn.preprocessing.Binarizer

We will use combinations of these preprocessors in later chapters, along with other types of Transformers object.

Putting it all together

We can now create a workflow by combining the code from the previous sections, using the broken dataset previously calculated:

X_transformed = MinMaxScaler().fit_transform(X_broken)
estimator = KNeighborsClassifier()
transformed_scores = cross_val_score(estimator, X_transformed, y, scoring='accuracy')
print("The average accuracy for is {0:.1f}%".format(np.mean(transformed_scores) * 100))

This gives us back our score of 82.3 percent accuracy. The MinMaxScaler resulted in features of the same scale, meaning that no features overpowered others by simply being bigger values. While the Nearest Neighbor algorithm can be confused with larger features, some algorithms handle scale differences better. In contrast, some are much worse!