Creating training and test sets

When a dataset is large enough, it's a good practice to split it into training and test sets, the former to be used for training the model and the latter to test its performances. In the following diagram, there's a schematic representation of this process:

Training/test set split process schema

There are two main rules in performing such an operation:

Both datasets must reflect the original distribution
The original dataset must be randomly shuffled before the split phase in order to avoid a correlation between consequent elements

With scikit-learn, this can be achieved by using the train_test_split() function:

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=1000)

The test_size parameter (as well as training_size) allows you to specify the percentage of elements to put into the test/training set. In this case, the ratio is 75 percent for training and 25 percent for the test phase. In classic machine learning tasks, this is a common ratio, however, in deep learning, it can be useful to extend the training set to 98% of the total data. The right split percentage depends on the specific scenario. In general, the rule of thumb is that the training set (as well as the test set) must represent the whole data generating process. Sometimes, it's necessary to rebalance the classes, as shown in the previous chapter, Chapter 2, Important Elements in Machine Learning, but it's extremely important not to exclude potential samples from the training phase – otherwise, the model won't ever be able to generalize correctly.

Another important parameter is random_state, which can accept a NumPy RandomState generator or an integer seed. In many cases, it's important to provide reproducibility for the experiments, so it's also necessary to avoid using different seeds and, consequently, different random splits:

In machine learning projects, I always suggest using the same random seed (it can also be 0 or completely omitted) or define a global RandomState which can be passed to all requiring functions. In this way, the reproducibility is guaranteed.

from sklearn.utils import check_random_state

rs = check_random_state(1000)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=rs)

In this way, if the seed is kept equal, all experiments have to lead to the same results and can be easily reproduced in different environments by other scientists.

For further information about NumPy random number generation, visit https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.RandomState.html.