I want to create a database with files. And, to easily search these files, I want to use some kind of hashing technique. However, I don't only want to find files that are EXACTLY the same, but also check if parts of the files are the same(i.e., the files are similar). in other words, similar files should have similar hashes.
This means that this kind of hash is not really a cryptographic hash because there should not be an 'avalanche effect' (avalanche effect means that each bit of data affects ALL other bits of other data.)
Another thing is that the hash does not need to be one-way, since it isn't used for securitypurposes but for the comparing of files.
So in essence, I'm searching for an algorithm that can create an unique hash for each unique input that:
Has (almost) no collision
Creates a similar output for similar inputs
Is shorter than the original file (otherwise it would be faster to simply compare the original files instead).
I was thinking of something like adding the first two characters together, then adding the 3rd and 4rth together, etc. However, this has a HUGE amount of collision since "1+4" is the same as "2+2", etc
I really have no idea how to start. Could somebody enlighten me please? :)
Consistency / Deterministic One way hashes will always produce the same output for a given input. This is important in ensuring that regardless of when or on what system a hash has been generated, it is guaranteed that the same input generated at a later time or on a different computer will result in the same output.
Therefore, there's always a chance that two different inputs will generate the same hash value. Take the well-known hash function CRC32, for example. If you feed this function the two strings “plumless” and “buckeroo”, it generates the same value. This is known as a hash collision.
Hash Collision If two different inputs are having the same hash value, it is called a collision. Since hash functions have infinite input length and a predefined output length, there is a possibility of two different inputs that produce same hash value.
When two items hash to the same slot, we must have a systematic method for placing the second item in the hash table. This process is called collision resolution.
This is commonly called the near duplicate detection problem and is not easy to solve; I would recommend the simhash algorithm (code is here).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With