Normally, the goal of hashing is to turn a continuous function into a discrete one: a small change in the input should cause a large change in the output. However, is there any hashing algorithm that will, (very) roughly speaking, return similar but (still different) hashes for similar inputs?
(An example of the use of this would be to check whether two files are "similar" by checking their hashes for similarity. Of course, some failure is always acceptable.)
The similarity hashing algorithm uses four sub-hash functions, each producing its own hash value. The four sub-hashes are concatenated to produce the final hash value. This modularity enables sub-hash functions to be added or removed, e.g., if an exploit for a sub-hash function is discovered.
At the same time, two keys can also generate an identical hash. This phenomenon is called a collision. A good hash function never produces the same hash value from two different inputs. As such, a hash function that comes with an extremely low risk of collision is considered acceptable.
Fuzzy hashing is a type of compression function for calculating the similarity between digital files. It attempts to automate the process of grouping similar malware. Fuzzy hash functions hold a certain tolerance for changes, and can tell how different two files are by comparing the similarity of their outputs.
Look at Locality Sensitive Hashing (LSH). That is a probabilistic way of quickly finding a bunch of points near a given one, for example.
Given a distance function that tells you how similar or different are your objects, you can also employ distance permutations: http://www.computer.org/portal/web/csdl/doi/10.1109/TPAMI.2007.70815 or sketches: http://portal.acm.org/citation.cfm?id=1638180
For an implementation of the latter approach: http://obsearch.net
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With