How it works...

This scenario simulates tampering with a file and then utilizing similarity hashing to detect the existence of tampering, as well as measuring the size of the delta. We begin with a vanilla Python executable and then tamper with it by adding a null byte at the end (step 1). In real life, a hacker may take a legitimate program and insert malicious code into the sample. We double-checked that the tempering was successful and examined its nature using a hexdump in step 2. We then ran a similarity computation using similarity hashing on the original and tempered file, to observe that a minor alteration took place (step 3). Utilizing only standard hashing, we would have no idea how the two files are related, other than to conclude that they are not the same file. Knowing how to compare files allows us to cluster malware and benign files in machine learning algorithms, as well as group them into families.