Example of contrastive pessimistic likelihood estimation

We are going to implement the CPLE algorithm in Python using a subset extracted from the MNIST dataset. For simplicity, we are going to use only the samples representing the digits 0 and 1:

from sklearn.datasets import load_digits

import numpy as np

X_a, Y_a = load_digits(return_X_y=True)

X = np.vstack((X_a[Y_a == 0], X_a[Y_a == 1]))
Y = np.vstack((np.expand_dims(Y_a, axis=1)[Y_a==0], np.expand_dims(Y_a, axis=1)[Y_a==1]))

nb_samples = X.shape[0]
nb_dimensions = X.shape[1]
nb_unlabeled = 150
Y_true = np.zeros((nb_unlabeled,))

unlabeled_idx = np.random.choice(np.arange(0, nb_samples, 1), replace=False, size=nb_unlabeled)
Y_true = Y[unlabeled_idx].copy()
Y[unlabeled_idx] = -1

After creating the restricted dataset (X, Y) which contain 360 samples, we randomly select 150 samples (about 42%) to become unlabeled (the corresponding y is -1). At this point, we can measure the performance of logistic regression trained only on the labeled dataset:

from sklearn.linear_model import LogisticRegression

lr_test = LogisticRegression()
lr_test.fit(X[Y.squeeze() != -1], Y[Y.squeeze() != -1].squeeze())
unlabeled_score = lr_test.score(X[Y.squeeze() == -1], Y_true)

print(unlabeled_score)
0.573333333333

So, the logistic regression shows 57% accuracy for the classification of the unlabeled samples. We can also evaluate the cross-validation score on the whole dataset (before removing some random labels):

from sklearn.model_selection import cross_val_score

total_cv_scores = cross_val_score(LogisticRegression(), X, Y.squeeze(), cv=10)

print(total_cv_scores)
[ 0.48648649  0.51351351  0.5         0.38888889  0.52777778  0.36111111
  0.58333333  0.47222222  0.54285714  0.45714286]

Thus, the classifier achieves an average 48% accuracy when using 10 folds (each test set contains 36 samples) if all the labels are known.

We can now implement a CPLE algorithm. The first thing is to initialize a LogisticRegression instance and the soft-labels:

lr = LogisticRegression()
q0 = np.random.uniform(0, 1, size=nb_unlabeled)

q0 is a random array of values bounded in the half-open interval [0, 1]; therefore, we also need a converter to transform q_i into an actual binary label:

We can achieve this using the NumPy function np.vectorize(), which allows us to apply a transformation to all the elements of a vector:

trh = np.vectorize(lambda x: 0.0 if x < 0.5 else 1.0)

In order to compute the log-likelihood, we need also a weighted log-loss (similar to the Scikit-Learn function log_loss(), which, however, computes the negative log-likelihood but doesn't support weights):

def weighted_log_loss(yt, p, w=None, eps=1e-15):
    if w is None:
        w_t = np.ones((yt.shape[0], 2))
    else:
        w_t = np.vstack((w, 1.0 - w)).T
    
    Y_t = np.vstack((1.0 - yt.squeeze(), yt.squeeze())).T
    L_t = np.sum(w_t * Y_t * np.log(np.clip(p, eps, 1.0 - eps)), axis=1)
    
    return np.mean(L_t)

This function computes the following expression:

We need also a function to build the dataset with variable soft-labels q_i:

def build_dataset(q):
    Y_unlabeled = trh(q)
    
    X_n = np.zeros((nb_samples, nb_dimensions))
    X_n[0:nb_samples - nb_unlabeled] = X[Y.squeeze()!=-1]
    X_n[nb_samples - nb_unlabeled:] = X[Y.squeeze()==-1]
    
    Y_n = np.zeros((nb_samples, 1))
    Y_n[0:nb_samples - nb_unlabeled] = Y[Y.squeeze()!=-1]
    Y_n[nb_samples - nb_unlabeled:] = np.expand_dims(Y_unlabeled, axis=1)
    
    return X_n, Y_n

At this point, we can define our contrastive log-likelihood:

def log_likelihood(q):
    X_n, Y_n = build_dataset(q)
    Y_soft = trh(q)
    
    lr.fit(X_n, Y_n.squeeze())
    
    p_sup = lr.predict_proba(X[Y.squeeze() != -1])
    p_semi = lr.predict_proba(X[Y.squeeze() == -1])
    
    l_sup = weighted_log_loss(Y[Y.squeeze() != -1], p_sup)
    l_semi = weighted_log_loss(Y_soft, p_semi, q)
    
    return l_semi - l_sup

This method will be called by the optimizer, passing a different q vector each time. The first step is building the new dataset and computing Y_soft, which are the labels corresponding to q. Then the logistic regression classifier is trained with with the dataset (as Y_n is a (k, 1) array, it's necessary to squeeze it to avoid a warning. The same thing is done when using Y as a boolean indicator). At this point, it's possible to compute both p_sup and p_semi using the method predict_proba() and, finally, we can compute the semi-supervised and supervised log-loss, which is the term, a function of q_i, that we want to minimize, while the maximization of θ is done implicitly when training the logistic regression.

The optimization is carried out using the BFGS algorithm implemented in SciPy:

from scipy.optimize import fmin_bfgs

q_end = fmin_bfgs(f=log_likelihood, x0=q0, maxiter=5000, disp=False)

This is a very fast algorithm, but the user is encouraged to experiment with methods or libraries. The two parameters we need in this case are f, which is the function to minimize, and x0, which is the initial condition for the independent variables. maxiter is useful for avoiding an excessive number of iterations when no improvements are achieved. Once the optimization is complete, q_end contains the optimal soft-labels. We can, therefore, rebuild our dataset:

X_n, Y_n = build_dataset(q_end)

With this final configuration, we can retrain the logistic regression and check the cross-validation accuracy:

final_semi_cv_scores = cross_val_score(LogisticRegression(), X_n, Y_n.squeeze(), cv=10)

print(final_semi_cv_scores)
[ 1.          1.          0.89189189  0.77777778  0.97222222  0.88888889
  0.61111111  0.88571429  0.94285714  0.48571429]

The semi-supervised solution based on the CPLE algorithms achieves an average 84% accuracy, outperforming, as expected, the supervised approach. The reader can try other examples using different classifiers, such SVM or Decision Trees, and verify when CPLE allows obtaining higher accuracy than other supervised algorithms.