Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String similarity score/hash

Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number (hash) for each string that can later tell me that two strings are or are not similar. Two similar strings should have similar (close) hashes.

Let's consider these strings and scores as an example:

Hello world                1000 Hello world!               1010 Hello earth                1125 Foo bar                    3250 FooBarbar                  3750 Foo Bar!                   3300 Foo world!                 2350 

You can see that Hello world! and Hello world are similar and their scores are close to each other.

This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value.

like image 682
Josef Sábl Avatar asked Dec 01 '10 11:12

Josef Sábl


People also ask

How do you calculate string similarity?

The way to check the similarity between any data point or groups is by calculating the distance between those data points. In textual data as well, we check the similarity between the strings by calculating the distance between one text to another text.

Is the hash of a string always the same?

Hashing works in one direction only – for a given piece of data, you'll always get the same hash BUT you can't turn a hash back into its original data. If you need to go in two directions, you need encrypting, rather than hashing.

Which string hashing is best?

If you just want to have a good hash function, and cannot wait, djb2 is one of the best string hash functions i know. it has excellent distribution and speed on many different sets of keys and table sizes. you are not likely to do better with one of the "well known" functions such as PJW, K&R[1], etc.


1 Answers

I believe what you're looking for is called a Locality Sensitive Hash. Whereas most hash algorithms are designed such that small variations in input cause large changes in output, these hashes attempt the opposite: small changes in input generate proportionally small changes in output.

As others have mentioned, there are inherent issues with forcing a multi-dimensional mapping into a 2-dimensional mapping. It's analogous to creating a flat map of the Earth... you can never accurately represent a sphere on a flat surface. Best you can do is find a LSH that is optimized for whatever feature it is you're using to determine whether strings are "alike".

like image 90
DougW Avatar answered Sep 21 '22 21:09

DougW