How to create a hash that is similar for similar input?

Tags:

I want to create a database with files. And, to easily search these files, I want to use some kind of hashing technique. However, I don't only want to find files that are EXACTLY the same, but also check if parts of the files are the same(i.e., the files are similar). in other words, similar files should have similar hashes.

This means that this kind of hash is not really a cryptographic hash because there should not be an 'avalanche effect' (avalanche effect means that each bit of data affects ALL other bits of other data.)

Another thing is that the hash does not need to be one-way, since it isn't used for securitypurposes but for the comparing of files.

So in essence, I'm searching for an algorithm that can create an unique hash for each unique input that:

Has (almost) no collision
Creates a similar output for similar inputs
Is shorter than the original file (otherwise it would be faster to simply compare the original files instead).

I was thinking of something like adding the first two characters together, then adding the 3rd and 4rth together, etc. However, this has a HUGE amount of collision since "1+4" is the same as "2+2", etc

I really have no idea how to start. Could somebody enlighten me please? :)

358

asked Nov 26 '11 22:11

Qqwy

1 Answers

This is commonly called the near duplicate detection problem and is not easy to solve; I would recommend the simhash algorithm (code is here).

178

answered Sep 27 '22 23:09

Jeff Kubina

Related questions
                            
                                Removing multiple phrases from string column efficiently
                            
                                Splitting strings into number and string (with missings)
                            
                                Why does CompareTo not sort my string using ASCII code ordering?
                            
                                numbering characters in a string
                            
                                Java String ignore case implementation [duplicate]
                            
                                Modifying a string while looping on it
                            
                                PHP long string without newline
                            
                                memorystream - stringstream, string, others?
                            
                                String.Format Phone Numbers with Extension
                            
                                Removing text in all kinds of braces
                            
                                Select a specific character in a string and offset it (visually) with Jquery
                            
                                string format in C#
                            
                                String encryption with Jasypt library
                            
                                common java-function to create Maps from strings
                            
                                Cutting text without destroying html tags
                            
                                Producing all possible matches of a regular expression
                            
                                In Java, is the immutability of Strings considered in the implementation of String.format()?
                            
                                Processing text coming off a serial line in C#
                            
                                How to search for multiple strings in a text file
                            
                                Converting a string to the Enum class

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to create a hash that is similar for similar input?

Tags:

string

file

algorithm

comparison

hash

Qqwy

People also ask

1 Answers

Jeff Kubina

Recent Activity

Donate For Us