What are some algorithms for comparing how similar two strings are?

Tags:

I need to compare strings to decide whether they represent the same thing. This relates to case titles entered by humans where abbreviations and other small details may differ. For example, consider the following two titles:

std::string first = "Henry C. Harper v. The Law Offices of Huey & Luey, LLP";

As opposed to:

std::string second = "Harper v. The Law Offices of Huey & Luey, LLP";

A human can quickly gauge that these are most likely one and the same. The current approach I have taken is to normalize the strings by lowercasing all letters and removing all punctuation and spaces giving:

std::string firstNormalized = "henrycharpervthelawofficesofhueylueyllp";

And:

std::string secondNormalized = "harpervthelawofficesofhueylueyllp";

Comparing in this case, one is a sub-sequence of the other, but you can imagine other more complex variations where that does not necessarily occur, yet they have significant sub-sequences in common. There could also be occasional human entry errors such as transposed letters and spelling errors.

Perhaps some kind of character diff program could help? I've seen good line diff programs for comparing differences in code to be checked in, is there something like that on a character basis, maybe in boost? If you could count the number of consecutive characters in common and take the ratio to the characters unshared, perhaps that would be a good heuristic?

In the end, I need a Boolean decision as to whether to consider them the same or not. It doesn't have to be perfect, but it should ideally rarely be wrong.

What algorithm can I use that will give me some kind of quantification as to how similar the two strings are to each other which I can then convert into a yes/no answer by way of some heuristic?

947

asked Mar 08 '13 21:03

WilliamKF

2 Answers

What you're looking for are called String Metric algorithms. There a significant number of them, many with similar characteristics. Among the more popular:

Levenshtein Distance : The minimum number of single-character edits required to change one word into the other. Strings do not have to be the same length
Hamming Distance : The number of characters that are different in two equal length strings.
Smith–Waterman : A family of algorithms for computing variable sub-sequence similarities.
Sørensen–Dice Coefficient : A similarity algorithm that computes difference coefficients of adjacent character pairs.

Have a look at these as well as others on the wiki page on the topic.

150

answered Oct 02 '22 05:10

Daniel Frey

Damerau Levenshtein distance is another algorithm for comparing two strings and it is similar to the Levenshtein distance algorithm. The difference between the two is that it can also check transpositions between characters and hence may give a better result for error correction.

For example: The Levenshtein distance between night and nigth is 2 but Damerau Levenshtein distance between night and nigth will be 1 because it is just a swap of a pair of characters.

answered Oct 02 '22 06:10

Ankit Chaurasia

Related questions
                            
                                Permutation of array
                            
                                Problem solving/ Algorithm Skill is a knack or can be developed with practice? [closed]
                            
                                Why not use heap sort always [duplicate]
                            
                                Rotate image and crop out black borders
                            
                                Most efficient way to see if an ArrayList contains an object in Java
                            
                                Combine Gyroscope and Accelerometer Data
                            
                                Manacher's algorithm (algorithm to find longest palindrome substring in linear time)
                            
                                Sorting an almost sorted array (elements misplaced by no more than k)
                            
                                Sparse matrices / arrays in Java
                            
                                Lazy Evaluation and Time Complexity
                            
                                Find the 2nd largest element in an array with minimum number of comparisons
                            
                                How to calculate elapsed time from now with Joda-Time?
                            
                                Generating combinations in c++
                            
                                Finding All Combinations (Cartesian product) of JavaScript array values
                            
                                Binary Trees vs. Linked Lists vs. Hash Tables
                            
                                How do I get the intersection between two arrays as a new array?
                            
                                Algorithm to find next greater permutation of a given string
                            
                                Finding height in Binary Search Tree
                            
                                Natural sort order string comparison in Java - is one built in? [duplicate]
                            
                                Generating permutations of a set (most efficiently)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are some algorithms for comparing how similar two strings are?

Tags:

language-agnostic

algorithm

string-comparison

stdstring

heuristics

WilliamKF

People also ask

2 Answers

Daniel Frey

Ankit Chaurasia

Recent Activity

Donate For Us