Training and validation sets

In real problems, the number of samples is limited, and it's usually necessary to split the initial set X (together with Y) into two subsets as follows:

  • Training set used to train the model
  • Validation set used to assess the score of the model without any bias, with samples never seen before

According to the nature of the problem, it's possible to choose a split percentage ratio of 70% – 30% (a good practice in machine learning, where the datasets are relatively small), or a higher training percentage (80%, 90%, up to 99%) for deep learning tasks where the number of samples is very high. In both cases, we are assuming that the training set contains all the information required for a consistent generalization. In many simple cases, this is true and can be easily verified; but with more complex datasets, the problem becomes harder. Even if we think to draw all the samples from the same distribution, it can happen that a randomly selected test set contains features that are not present in other training samples. Such a condition can have a very negative impact on global accuracy and, without other methods, it can also be very difficult to identify. This is one of the reasons why, in deep learning, training sets are huge: considering the complexity of the features and structure of the data generating distributions, choosing large test sets can limit the possibility of learning particular associations.

In Scikit-Learn, it's possible to split the original dataset using the train_test_split() function, which allows specifying the train/test size, and if we expect to have randomly shuffled sets (default). For example, if we want to split X and Y, with 70% training and 30% test, we can use:

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.7, random_state=1)

Shuffling the sets is always a good practice, in order to reduce the correlation between samples. In fact, we have assumed that X is made up of i.i.d samples, but several times two subsequent samples have a strong correlation, reducing the training performance. In some cases, it's also useful to re-shuffle the training set after each training epoch; however, in the majority of our examples, we are going to work with the same shuffled dataset throughout the whole process. Shuffling has to be avoided when working with sequences and models with memory: in all those cases, we need to exploit the existing correlation to determine how the future samples are distributed. 

When working with NumPy and Scikit-Learn, it's always a good practice to set the random seed to a constant value, so as to allow other people to reproduce the experiment with the same initial conditions. This can be achieved by calling np.random.seed(...) and using the random-state parameter present in many Scikit-Learn methods.