I want to do fuzzy matching of millions of records from multiple files. I identified two algorithms for that: Jaro-Winkler and Levenshtein edit distance.
I was not able to understand what the difference is between the two. It seems Levenshtein gives the number of edits between two strings, and Jaro-Winkler provides a normalized score between 0.0 to 1.0.
My questions:
What are the fundamental differences between the two algorithms?
What is the performance difference between the two algorithms?
Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.
The Levenshtein distance is a string metric for measuring difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
The Jaro distance is a measure of edit distance between two strings; its inverse, called the Jaro similarity, is a measure of two strings' similarity: the higher the value, the more similar the strings are. The score is normalized such that 0 equates to no similarities and 1 is an exact match.
Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.
Levenshtein counts the number of edits (insertions, deletions, or substitutions) needed to convert one string to the other. Damerau-Levenshtein is a modified version that also considers transpositions as single edits. Although the output is the integer number of edits, this can be normalized to give a similarity value by the formula
1 - (edit distance / length of the larger of the two strings)
The Jaro algorithm is a measure of characters in common, being no more than half the length of the longer string in distance, with consideration for transpositions. Winkler modified this algorithm to support the idea that differences near the start of the string are more significant than differences near the end of the string. Jaro and Jaro-Winkler are suited for comparing smaller strings like words and names.
Deciding which to use is not just a matter of performance. It's important to pick a method that is suited to the nature of the strings you are comparing. In general though, both of the algorithms you mentioned can be expensive, because each string must be compared to every other string, and with millions of strings in your data set, that is a tremendous number of comparisons. That is much more expensive than something like computing a phonetic encoding for each string, and then simply grouping strings sharing identical encodings.
There is a wealth of detailed information on these algorithms and other fuzzy string matching algorithms on the internet. This one will give you a start:
A Comparison of Personal Name Matching: Techniques and Practical Issues
According to that paper, the speed of the four Jaro and Levenshtein algorithms I've mentioned are from fastest to slowest:
with the slowest taking 2 to 3 times as long as the fastest. Of course these times are dependent on the lengths of the strings and the implementations, and there are ways to optimize these algorithms that may not have been used.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With