Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a hash that is similar for similar input?

I want to create a database with files. And, to easily search these files, I want to use some kind of hashing technique. However, I don't only want to find files that are EXACTLY the same, but also check if parts of the files are the same(i.e., the files are similar). in other words, similar files should have similar hashes.

This means that this kind of hash is not really a cryptographic hash because there should not be an 'avalanche effect' (avalanche effect means that each bit of data affects ALL other bits of other data.)

Another thing is that the hash does not need to be one-way, since it isn't used for securitypurposes but for the comparing of files.

So in essence, I'm searching for an algorithm that can create an unique hash for each unique input that:

  • Has (almost) no collision

  • Creates a similar output for similar inputs

  • Is shorter than the original file (otherwise it would be faster to simply compare the original files instead).

I was thinking of something like adding the first two characters together, then adding the 3rd and 4rth together, etc. However, this has a HUGE amount of collision since "1+4" is the same as "2+2", etc

I really have no idea how to start. Could somebody enlighten me please? :)

like image 358
Qqwy Avatar asked Nov 26 '11 22:11

Qqwy


People also ask

Does hash produce same output for same input?

Consistency / Deterministic One way hashes will always produce the same output for a given input. This is important in ensuring that regardless of when or on what system a hash has been generated, it is guaranteed that the same input generated at a later time or on a different computer will result in the same output.

Can two inputs have the same hash?

Therefore, there's always a chance that two different inputs will generate the same hash value. Take the well-known hash function CRC32, for example. If you feed this function the two strings “plumless” and “buckeroo”, it generates the same value. This is known as a hash collision.

Can hash values be the same?

Hash Collision If two different inputs are having the same hash value, it is called a collision. Since hash functions have infinite input length and a predefined output length, there is a possibility of two different inputs that produce same hash value.

What is it called when two elements hash to the same position?

When two items hash to the same slot, we must have a systematic method for placing the second item in the hash table. This process is called collision resolution.


1 Answers

This is commonly called the near duplicate detection problem and is not easy to solve; I would recommend the simhash algorithm (code is here).

like image 178
Jeff Kubina Avatar answered Sep 27 '22 23:09

Jeff Kubina