Hash Lab

Ecosystem · Similarity hashes

Fuzzy / similarity hashes

Cryptographic hashes are designed so that one bit of input change completely scrambles the output. Fuzzy hashes are designed for the opposite property: similar inputs should produce similar outputs. The canonical use is malware analysis , cluster variants of the same family, detect packed-but-related samples, find shared code across binaries.

The four major families

ssdeep (context-triggered piecewise hashing)

sdhash (similarity digest hash)

TLSH (Trend Micro Locality Sensitive Hash)

imphash (PE import hash)

When each one shines

Use caseBest choice
Cluster a small dataset of unknown binariesssdeep
Compare large binaries with reordered contentsdhash
Production-scale fingerprinting with fast pairwise comparisonTLSH
Pivot in threat intelligence (Windows PE specifically)imphash
Cluster documents / textMinHash / SimHash
Cluster imagesPerceptual hashes

Adversarial caveats

References