Elements of information theory

A machine learning problem can also be analyzed in terms of information transfer or exchange. Our dataset is composed of n features, which are considered independent (for simplicity, even if it's often a realistic assumption) and drawn from n different statistical distributions. Therefore, there are n probability density functions pi(x) which must be approximated through other n qi(x) functions. In any machine learning task, it's very important to understand how two corresponding distributions diverge and what the amount of information we lose is when approximating the original dataset.