Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hashing Similarity

Tags:

hash

Normally, the goal of hashing is to turn a continuous function into a discrete one: a small change in the input should cause a large change in the output. However, is there any hashing algorithm that will, (very) roughly speaking, return similar but (still different) hashes for similar inputs?

(An example of the use of this would be to check whether two files are "similar" by checking their hashes for similarity. Of course, some failure is always acceptable.)

like image 214
user541686 Avatar asked Jan 29 '11 00:01

user541686


People also ask

What is similarity hashing?

The similarity hashing algorithm uses four sub-hash functions, each producing its own hash value. The four sub-hashes are concatenated to produce the final hash value. This modularity enables sub-hash functions to be added or removed, e.g., if an exploit for a sub-hash function is discovered.

Can hash values be the same?

At the same time, two keys can also generate an identical hash. This phenomenon is called a collision. A good hash function never produces the same hash value from two different inputs. As such, a hash function that comes with an extremely low risk of collision is considered acceptable.

What is fuzzy hashing?

Fuzzy hashing is a type of compression function for calculating the similarity between digital files. It attempts to automate the process of grouping similar malware. Fuzzy hash functions hold a certain tolerance for changes, and can tell how different two files are by comparing the similarity of their outputs.


2 Answers

Look at Locality Sensitive Hashing (LSH). That is a probabilistic way of quickly finding a bunch of points near a given one, for example.

like image 105
Jeremiah Willcock Avatar answered Oct 15 '22 16:10

Jeremiah Willcock


Given a distance function that tells you how similar or different are your objects, you can also employ distance permutations: http://www.computer.org/portal/web/csdl/doi/10.1109/TPAMI.2007.70815 or sketches: http://portal.acm.org/citation.cfm?id=1638180

For an implementation of the latter approach: http://obsearch.net

like image 20
Arnoldo Muller Avatar answered Oct 15 '22 16:10

Arnoldo Muller