Feature Selection and Feature Engineering

Feature engineering is the first step in a machine learning pipeline and involves all the techniques adopted to clean existing datasets, increase their signal-noise ratio, and reduce their dimensionality. Most algorithms have strong assumptions about the input data, and their performances can be negatively affected when raw datasets are used. Moreover, the data is seldom isotropic; there are often features that determine the general behavior of a sample, while others that are correlated don't provide any additional pieces of information. So, it's important to have a clear view of a dataset and know the most common algorithms used to reduce the number of features or select only the best ones.

In particular, we are going to discuss the following topics:

  • How to work with scikit-learn built-in datasets and split them into training and test sets
  • How to manage missing and categorical features
  • How to filter and select the features according to different criteria
  • How to normalize, scale, and whiten a dataset
  • How to reduce the dimensionality of a dataset using the Principal Component Analysis (PCA)
  • How to perform a PCA on non-linear datasets
  • How to extract independent components and create dictionaries of atoms
  • How to visualize high-dimensional datasets using the t-SNE algorithm