- Mastering Machine Learning Algorithms
- Giuseppe Bonaccorso
- 784字
- 2021-06-25 22:07:32
Example of contrastive pessimistic likelihood estimation
We are going to implement the CPLE algorithm in Python using a subset extracted from the MNIST dataset. For simplicity, we are going to use only the samples representing the digits 0 and 1:
from sklearn.datasets import load_digits
import numpy as np
X_a, Y_a = load_digits(return_X_y=True)
X = np.vstack((X_a[Y_a == 0], X_a[Y_a == 1]))
Y = np.vstack((np.expand_dims(Y_a, axis=1)[Y_a==0], np.expand_dims(Y_a, axis=1)[Y_a==1]))
nb_samples = X.shape[0]
nb_dimensions = X.shape[1]
nb_unlabeled = 150
Y_true = np.zeros((nb_unlabeled,))
unlabeled_idx = np.random.choice(np.arange(0, nb_samples, 1), replace=False, size=nb_unlabeled)
Y_true = Y[unlabeled_idx].copy()
Y[unlabeled_idx] = -1
After creating the restricted dataset (X, Y) which contain 360 samples, we randomly select 150 samples (about 42%) to become unlabeled (the corresponding y is -1). At this point, we can measure the performance of logistic regression trained only on the labeled dataset:
from sklearn.linear_model import LogisticRegression
lr_test = LogisticRegression()
lr_test.fit(X[Y.squeeze() != -1], Y[Y.squeeze() != -1].squeeze())
unlabeled_score = lr_test.score(X[Y.squeeze() == -1], Y_true)
print(unlabeled_score)
0.573333333333
So, the logistic regression shows 57% accuracy for the classification of the unlabeled samples. We can also evaluate the cross-validation score on the whole dataset (before removing some random labels):
from sklearn.model_selection import cross_val_score
total_cv_scores = cross_val_score(LogisticRegression(), X, Y.squeeze(), cv=10)
print(total_cv_scores)
[ 0.48648649 0.51351351 0.5 0.38888889 0.52777778 0.36111111 0.58333333 0.47222222 0.54285714 0.45714286]
Thus, the classifier achieves an average 48% accuracy when using 10 folds (each test set contains 36 samples) if all the labels are known.
We can now implement a CPLE algorithm. The first thing is to initialize a LogisticRegression instance and the soft-labels:
lr = LogisticRegression()
q0 = np.random.uniform(0, 1, size=nb_unlabeled)
q0 is a random array of values bounded in the half-open interval [0, 1]; therefore, we also need a converter to transform qi into an actual binary label:
We can achieve this using the NumPy function np.vectorize(), which allows us to apply a transformation to all the elements of a vector:
trh = np.vectorize(lambda x: 0.0 if x < 0.5 else 1.0)
In order to compute the log-likelihood, we need also a weighted log-loss (similar to the Scikit-Learn function log_loss(), which, however, computes the negative log-likelihood but doesn't support weights):
def weighted_log_loss(yt, p, w=None, eps=1e-15):
if w is None:
w_t = np.ones((yt.shape[0], 2))
else:
w_t = np.vstack((w, 1.0 - w)).T
Y_t = np.vstack((1.0 - yt.squeeze(), yt.squeeze())).T
L_t = np.sum(w_t * Y_t * np.log(np.clip(p, eps, 1.0 - eps)), axis=1)
return np.mean(L_t)
This function computes the following expression:
We need also a function to build the dataset with variable soft-labels qi:
def build_dataset(q):
Y_unlabeled = trh(q)
X_n = np.zeros((nb_samples, nb_dimensions))
X_n[0:nb_samples - nb_unlabeled] = X[Y.squeeze()!=-1]
X_n[nb_samples - nb_unlabeled:] = X[Y.squeeze()==-1]
Y_n = np.zeros((nb_samples, 1))
Y_n[0:nb_samples - nb_unlabeled] = Y[Y.squeeze()!=-1]
Y_n[nb_samples - nb_unlabeled:] = np.expand_dims(Y_unlabeled, axis=1)
return X_n, Y_n
At this point, we can define our contrastive log-likelihood:
def log_likelihood(q):
X_n, Y_n = build_dataset(q)
Y_soft = trh(q)
lr.fit(X_n, Y_n.squeeze())
p_sup = lr.predict_proba(X[Y.squeeze() != -1])
p_semi = lr.predict_proba(X[Y.squeeze() == -1])
l_sup = weighted_log_loss(Y[Y.squeeze() != -1], p_sup)
l_semi = weighted_log_loss(Y_soft, p_semi, q)
return l_semi - l_sup
This method will be called by the optimizer, passing a different q vector each time. The first step is building the new dataset and computing Y_soft, which are the labels corresponding to q. Then the logistic regression classifier is trained with with the dataset (as Y_n is a (k, 1) array, it's necessary to squeeze it to avoid a warning. The same thing is done when using Y as a boolean indicator). At this point, it's possible to compute both psup and psemi using the method predict_proba() and, finally, we can compute the semi-supervised and supervised log-loss, which is the term, a function of qi, that we want to minimize, while the maximization of θ is done implicitly when training the logistic regression.
The optimization is carried out using the BFGS algorithm implemented in SciPy:
from scipy.optimize import fmin_bfgs
q_end = fmin_bfgs(f=log_likelihood, x0=q0, maxiter=5000, disp=False)
This is a very fast algorithm, but the user is encouraged to experiment with methods or libraries. The two parameters we need in this case are f, which is the function to minimize, and x0, which is the initial condition for the independent variables. maxiter is useful for avoiding an excessive number of iterations when no improvements are achieved. Once the optimization is complete, q_end contains the optimal soft-labels. We can, therefore, rebuild our dataset:
X_n, Y_n = build_dataset(q_end)
With this final configuration, we can retrain the logistic regression and check the cross-validation accuracy:
final_semi_cv_scores = cross_val_score(LogisticRegression(), X_n, Y_n.squeeze(), cv=10)
print(final_semi_cv_scores)
[ 1. 1. 0.89189189 0.77777778 0.97222222 0.88888889 0.61111111 0.88571429 0.94285714 0.48571429]
The semi-supervised solution based on the CPLE algorithms achieves an average 84% accuracy, outperforming, as expected, the supervised approach. The reader can try other examples using different classifiers, such SVM or Decision Trees, and verify when CPLE allows obtaining higher accuracy than other supervised algorithms.