Error measures and cost functions

In general, when working with a supervised scenario, we define a non-negative error measure e_m which takes two arguments (expected and predicted output ) and allows us to compute a total error value over the whole dataset (made up of N samples):

This value is also implicitly dependent on the specific hypothesis H through the parameter set, and therefore optimizing the error implies finding an optimal hypothesis (considering the hardness of many optimization problems, this is not the absolute best one, but an acceptable approximation). In many cases, it's useful to consider the mean square error (MSE):

Its initial value represents a starting point over the surface of an n-variable function. A generic training algorithm has to find the global minimum or a point quite close to it (there's always a tolerance to avoid an excessive number of iterations and a consequent risk of overfitting). This measure is also called loss function (or cost function) because its value must be minimized through an optimization problem. When it's easy to determine an element that must be maximized, the corresponding loss function will be its reciprocal.

Another useful loss function is called zero-one-loss, and it's particularly efficient for binary classifications (also for the one-versus-rest multiclass strategy):

This function is implicitly an indicator and can be easily adopted in loss functions based on the probability of misclassification.

A helpful interpretation of a generic (and continuous) loss function can be expressed in terms of potential energy:

The predictor is like a ball upon a rough surface: starting from a random point where energy (that is, error) is usually rather high, it must move until it reaches a stable equilibrium point where its energy (relative to the global minimum) is null. In the following diagram, there's a schematic representation of some different situations:

Energy curve with some peculiar points

Just like in the physical situation, the starting point is stable without any external perturbation, so to start the process, it's necessary to provide the initial kinetic energy. However, if such an energy is strong enough, then after descending over the slope, the ball cannot stop in the global minimum. The residual kinetic energy can be enough to overcome the ridge and reach the right-hand valley (plateau). If there are no other energy sources, the ball gets trapped in the plain valley and cannot move anymore. There are many techniques that have been engineered to solve this problem and avoid local minima. Another common problem (in particular, when working with deep models) is represented by the saddle points, which are characterized by a null gradient and positive semidefinite Hessian matrix. This means that the point is neither a local minimum, nor a maximum, but it can behave like a minimum moving in a direction and, like a maximum, moving in another one. In the following diagram, we can see a 3-dimensional example of this:

Example of a saddle point

Every situation must always be carefully analyzed to understand what level of residual energy (or error) is acceptable, or whether it's better to adopt a different strategy. In particular, all the models that are explicitly based on a loss/cost function can normally be trained using different optimization algorithms that can overcome the problems that a simpler solution isn't able to do. We're going to discuss some of them in the following chapters.