Managing missing features

Sometimes, a dataset can contain missing features, so there are a few options that can be taken into account:

Removing the whole line
Creating a submodel to predict those features
Using an automatic strategy to input them according to the other known values

The first option is the most drastic one and should only be considered when the dataset is quite large, the number of missing features is high, and any prediction could be risky. The second option is much more difficult because it's necessary to determine a supervised strategy to train a model for each feature and, finally, to predict their value. Considering all pros and cons, the third option is likely to be the best choice. scikit-learn offers the Imputer class, which is responsible for filling the holes using a strategy based on the mean (default choice), median, or frequency (the most frequent entry will be used for all the missing ones).

The following snippet shows an example that's using the three approaches (the default value for a missing feature entry is NaN, however, it's possible to use a different placeholder through the missing_values parameter):

from sklearn.preprocessing import Imputer

data = np.array([[1, np.nan, 2], [2, 3, np.nan], [-1, 4, 2]])

imp = Imputer(strategy='mean')
imp.fit_transform(data)
array([[ 1. ,  3.5,  2. ],
       [ 2. ,  3. ,  2. ],
       [-1. ,  4. ,  2. ]])

imp = Imputer(strategy='median')
imp.fit_transform(data)
array([[ 1. ,  3.5,  2. ],
       [ 2. ,  3. ,  2. ],
       [-1. ,  4. ,  2. ]])

imp = Imputer(strategy='most_frequent')
imp.fit_transform(data)
array([[ 1.,  3.,  2.],
       [ 2.,  3.,  2.],
       [-1.,  4.,  2.]])