How to do it...

In the following steps, we will see a demonstration of how scikit-learn's K-means clustering algorithm performs on a toy PE malware classification:

Start by importing and plotting the dataset:

import pandas as pd
import plotly.express as px

df = pd.read_csv("file_pe_headers.csv", sep=",")
fig = px.scatter_3d(
    df,
    x="SuspiciousImportFunctions",
    y="SectionsLength",
    z="SuspiciousNameSection",
    color="Malware",
)
fig.show()

The following screenshot shows the output:

Extract the features and target labels:

y = df["Malware"]
X = df.drop(["Name", "Malware"], axis=1).to_numpy()

Next, import scikit-learn's clustering module and fit a K-means model with two clusters to the data:

from sklearn.cluster import KMeans

estimator = KMeans(n_clusters=len(set(y)))
estimator.fit(X)

Predict the cluster using our trained algorithm:

y_pred = estimator.predict(X)
df["pred"] = y_pred
df["pred"] = df["pred"].astype("category")

To see how the algorithm did, plot the algorithm's clusters:

fig = px.scatter_3d(
    df,
    x="SuspiciousImportFunctions",
    y="SectionsLength",
    z="SuspiciousNameSection",
    color="pred",
)
fig.show()

The following screenshot shows the output:

The results are not perfect, but we can see that the clustering algorithm captured much of the structure in the dataset.