Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number (hash) for each string that can later tell me that two strings are or are not similar. Two similar strings should have similar (close) hashes.
Let's consider these strings and scores as an example:
Hello world 1000 Hello world! 1010 Hello earth 1125 Foo bar 3250 FooBarbar 3750 Foo Bar! 3300 Foo world! 2350
You can see that Hello world! and Hello world are similar and their scores are close to each other.
This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value.
The way to check the similarity between any data point or groups is by calculating the distance between those data points. In textual data as well, we check the similarity between the strings by calculating the distance between one text to another text.
Hashing works in one direction only – for a given piece of data, you'll always get the same hash BUT you can't turn a hash back into its original data. If you need to go in two directions, you need encrypting, rather than hashing.
If you just want to have a good hash function, and cannot wait, djb2 is one of the best string hash functions i know. it has excellent distribution and speed on many different sets of keys and table sizes. you are not likely to do better with one of the "well known" functions such as PJW, K&R[1], etc.
I believe what you're looking for is called a Locality Sensitive Hash. Whereas most hash algorithms are designed such that small variations in input cause large changes in output, these hashes attempt the opposite: small changes in input generate proportionally small changes in output.
As others have mentioned, there are inherent issues with forcing a multi-dimensional mapping into a 2-dimensional mapping. It's analogous to creating a flat map of the Earth... you can never accurately represent a sphere on a flat surface. Best you can do is find a LSH that is optimized for whatever feature it is you're using to determine whether strings are "alike".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With