- Mastering Machine Learning Algorithms
- Giuseppe Bonaccorso
- 380字
- 2021-06-25 22:07:35
Label propagation based on Markov random walks
The goal of this algorithm proposed by Zhu and Ghahramani is to find the probability distribution of target labels for unlabeled samples given a mixed dataset. This objective is achieved through the simulation of a stochastic process, where each unlabeled sample walks through the graph until it reaches a stationary absorbing state, a labeled sample where it stops acquiring the corresponding label. The main difference with other similar approaches is that in this case, we consider the probability of reaching a labeled sample. In this way, the problem acquires a closed form and can be easily solved.
The first step is to always build a k-nearest neighbors graph with all N samples, and define a weight matrix W based on an RBF kernel:
Wij = 0 is xi, and xj are not neighbors and Wii = 1. The transition probability matrix, similarly to the Scikit-Learn label propagation algorithm, is built as:
In a more compact way, it can be rewritten as P = D-1W. If we now consider a test sample, starting from the state xi and randomly walking until an absorbing labeled state is found (we call this label y∞), the probability (referred to as binary classification) can be expressed as:
When xi is labeled, the state is final, and it is represented by the indicator function based on the condition yi=1. When the sample is unlabeled, we need to consider the sum of all possible transitions starting from xi and ending in the closest absorbing state, with label y=1 weighted by the relative transition probabilities.
We can rewrite this expression in matrix form. If we create a vector P∞ = [ PL(y∞=1|XL), PU(y∞=1|XU) ], where the first component is based on labeled samples and the second on the unlabeled ones, we can write:
If we now expand the matrices, we get:
As we are interested only in the unlabeled samples, we can consider only the second equation:
Simplifying the expression, we get the following linear system:
The term (Duu - Wuu) is the unlabeled-unlabeled part of the unnormalized graph Laplacian L = D - W. By solving this system, we can get the probabilities for the class y=1 for all unlabeled samples.