Ecosystem · Similarity hashes
Fuzzy / similarity hashes
Cryptographic hashes are designed so that one bit of input change completely scrambles the output. Fuzzy hashes are designed for the opposite property: similar inputs should produce similar outputs. The canonical use is malware analysis , cluster variants of the same family, detect packed-but-related samples, find shared code across binaries.
The four major families
ssdeep (context-triggered piecewise hashing)
- Author: Jesse Kornblum (2006).
- How it works: Use a rolling hash (Adler-like) over the input to choose “trigger points”; hash the chunk between trigger points; concatenate.
- Format:
blocksize:hash1:hash2(two block sizes per signature). - Comparison: a similarity score 0-100 derived from edit distance of the two hash strings.
- Use: first-pass malware family clustering, NIST NSRL extensions.
- ssdeep project
sdhash (similarity digest hash)
- Author: Vassil Roussev (2010).
- How it works: select “statistically improbable features” via local entropy, summarize them in a Bloom-filter-style signature.
- Comparison: set-overlap of the bloom filters.
- Strengths over ssdeep: handles content reordering and small variants more gracefully.
- sdhash on GitHub
TLSH (Trend Micro Locality Sensitive Hash)
- Author: Trend Micro (Oliver, Cheng, Chen, 2013).
- How it works: 6-byte header + 32-byte body; body summarizes input via a triplet-counting histogram, body bytes hold buckets quantized against quartiles.
- Comparison: Hamming-style distance, where 0 means identical and ~100+ means unrelated.
- Strengths: fixed-size output, very fast comparison, well-defined distance metric.
- TLSH on GitHub
imphash (PE import hash)
- Author: Mandiant / FireEye (2014).
- How it works: MD5 of the sorted, lowercased list of imported DLL function names from a Windows PE file’s Import Table.
- Not a fuzzy hash in the technical sense , it’s an exact hash of a summary of the file. But it’s “fuzzy enough” that two malware samples sharing import patterns hit the same imphash.
- Use: Pivoting between samples in VirusTotal / threat-intel platforms.
- Mandiant’s original blog post
When each one shines
| Use case | Best choice |
|---|---|
| Cluster a small dataset of unknown binaries | ssdeep |
| Compare large binaries with reordered content | sdhash |
| Production-scale fingerprinting with fast pairwise comparison | TLSH |
| Pivot in threat intelligence (Windows PE specifically) | imphash |
| Cluster documents / text | MinHash / SimHash |
| Cluster images | Perceptual hashes |
Adversarial caveats
- Malware authors who know ssdeep / TLSH is being used can pad, reorder, or insert noise to break the similarity score. None of these hashes are adversarially robust.
- False positives across unrelated files do happen at scale. Always corroborate with at least one additional signal (file structure, behavior, additional hashes).
- imphash is brittle , small changes to the import table (different DLL versions, fewer wrapped functions) shift the hash entirely.
References
- ssdeep usage docs
- sdhash on GitHub
- TLSH overview
- imphash original blog
- Perceptual hashes · MinHash