Ecosystem · Similarity hashes

Fuzzy / similarity hashes

Cryptographic hashes are designed so that one bit of input change completely scrambles the output. Fuzzy hashes are designed for the opposite property: similar inputs should produce similar outputs. The canonical use is malware analysis , cluster variants of the same family, detect packed-but-related samples, find shared code across binaries.

The four major families

ssdeep (context-triggered piecewise hashing)

Author: Jesse Kornblum (2006).
How it works: Use a rolling hash (Adler-like) over the input to choose “trigger points”; hash the chunk between trigger points; concatenate.
Format: blocksize:hash1:hash2 (two block sizes per signature).
Comparison: a similarity score 0-100 derived from edit distance of the two hash strings.
Use: first-pass malware family clustering, NIST NSRL extensions.
ssdeep project

sdhash (similarity digest hash)

Author: Vassil Roussev (2010).
How it works: select “statistically improbable features” via local entropy, summarize them in a Bloom-filter-style signature.
Comparison: set-overlap of the bloom filters.
Strengths over ssdeep: handles content reordering and small variants more gracefully.
sdhash on GitHub

TLSH (Trend Micro Locality Sensitive Hash)

Author: Trend Micro (Oliver, Cheng, Chen, 2013).
How it works: 6-byte header + 32-byte body; body summarizes input via a triplet-counting histogram, body bytes hold buckets quantized against quartiles.
Comparison: Hamming-style distance, where 0 means identical and ~100+ means unrelated.
Strengths: fixed-size output, very fast comparison, well-defined distance metric.
TLSH on GitHub

imphash (PE import hash)

Author: Mandiant / FireEye (2014).
How it works: MD5 of the sorted, lowercased list of imported DLL function names from a Windows PE file’s Import Table.
Not a fuzzy hash in the technical sense , it’s an exact hash of a summary of the file. But it’s “fuzzy enough” that two malware samples sharing import patterns hit the same imphash.
Use: Pivoting between samples in VirusTotal / threat-intel platforms.
Mandiant’s original blog post

When each one shines

Use case	Best choice
Cluster a small dataset of unknown binaries	ssdeep
Compare large binaries with reordered content	sdhash
Production-scale fingerprinting with fast pairwise comparison	TLSH
Pivot in threat intelligence (Windows PE specifically)	imphash
Cluster documents / text	MinHash / SimHash
Cluster images	Perceptual hashes

Adversarial caveats

Malware authors who know ssdeep / TLSH is being used can pad, reorder, or insert noise to break the similarity score. None of these hashes are adversarially robust.
False positives across unrelated files do happen at scale. Always corroborate with at least one additional signal (file structure, behavior, additional hashes).
imphash is brittle , small changes to the import table (different DLL versions, fewer wrapped functions) shift the hash entirely.