String similarity score/hash

Tags:

Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number (hash) for each string that can later tell me that two strings are or are not similar. Two similar strings should have similar (close) hashes.

Let's consider these strings and scores as an example:

Click to copy

Hello world                1000 Hello world!               1010 Hello earth                1125 Foo bar                    3250 FooBarbar                  3750 Foo Bar!                   3300 Foo world!                 2350

You can see that Hello world! and Hello world are similar and their scores are close to each other.

This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value.

682

asked Dec 01 '10 11:12

Josef Sábl

1 Answers

I believe what you're looking for is called a Locality Sensitive Hash. Whereas most hash algorithms are designed such that small variations in input cause large changes in output, these hashes attempt the opposite: small changes in input generate proportionally small changes in output.

As others have mentioned, there are inherent issues with forcing a multi-dimensional mapping into a 2-dimensional mapping. It's analogous to creating a flat map of the Earth... you can never accurately represent a sphere on a flat surface. Best you can do is find a LSH that is optimized for whatever feature it is you're using to determine whether strings are "alike".

answered Sep 21 '22 21:09

DougW

Related questions
                            
                                Edit Distance in Python
                            
                                How should I map long to int in hashCode()?
                            
                                Algorithm/Data Structure Design Interview Questions [closed]
                            
                                Nice & universal way to convert List of items to Tree
                            
                                Simple Python Challenge: Fastest Bitwise XOR on Data Buffers
                            
                                Detecting if a string has unique characters: comparing my solution to "Cracking the Coding Interview?"
                            
                                Merge Sort a Linked List
                            
                                vba: get unique values from array
                            
                                How to implement 3 stacks with one array?
                            
                                Algorithm to generate bit mask
                            
                                String similarity metrics in Python
                            
                                Stack with find-min/find-max more efficient than O(n)?
                            
                                All Possible Combinations of a list of Values
                            
                                Is there a way to shorten this while condition?
                            
                                Difference between priority queue and a heap
                            
                                Euler project #18 approach
                            
                                Find the majority element in array
                            
                                how to check if a string looks randomized, or human generated and pronouncable?
                            
                                Is there a diff-like algorithm that handles moving block of lines?
                            
                                Fit rectangle around points

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

String similarity score/hash

Tags:

algorithm

hash

similarity

Josef Sábl

People also ask

1 Answers

DougW

Recent Activity

Donate For Us