Organizing your data

Just like any other network, a neural network depends on data. Previously, we used datasets containing 1,000 to 100,000 rows of data. Even in cases where more data was added, the low computational power of the systems would not allow us to organize this kind of data efficiently.

We always begin with training our network, which implies that we in fact need a training dataset that should consist of 60% of the total data in the dataset. This is a very important step, as here is where the neural network learns the values of the weights present in the dataset. The second phase is to see how well the network does with data that it has never seen before, which consists of 20% of the data in the dataset. This dataset is known as a cross-validation dataset. The aim of this phase is to see how the the network generalizes data that it was not trained for.

Based on the performance in this phase, we can vary the parameters to attain the best possible output. This phase will continue until we have achieved optimum performance. The remaining data in the dataset can now be used as the test dataset. The reason for having this dataset is to have a completely unbiased evaluation of the network. This is basically to understand the behavior of the network toward data that it has not seen before and not been optimized for.

If we were to have a visual representation of the organization of data as described previously, here is what it would look like as in the following diagram: 

This configuration was well known and widely used until recent years. Multiple variations to the percentages of data allocated to each dataset also existed. In the more recent era of deep learning, two things have changed substantially: 

  • Millions of rows of data are present in the datasets that we are currently using. 
  • The computational power of our processing systems has increased drastically because of advanced GPUs.

Due to these reasons, neural networks now have a deeper, bigger, and more complex architecture. Here is how our data will be organized: 

The training dataset has increased immensely; observe that 96% of the data will be used as a dataset, while 2% will be required for the development dataset and the remaining 2% for the testing dataset. In fact, it is even possible to have 99% of the data used to train the model and the remaining 1% of data to be divided between the training and the development datasets. At some points, it's OK to not have a test dataset at all. The only time we need to have a test dataset is when we need to have a completely unbiased evaluation. Through the course of this chapter, we shall hardly use the test dataset.

Notice how the cross-validation dataset becomes the development dataset. The functionality of the dataset does not vary.