Difference between Jaro-Winkler and Levenshtein distance? [closed]

Tags:

I want to do fuzzy matching of millions of records from multiple files. I identified two algorithms for that: Jaro-Winkler and Levenshtein edit distance.

I was not able to understand what the difference is between the two. It seems Levenshtein gives the number of edits between two strings, and Jaro-Winkler provides a normalized score between 0.0 to 1.0.

My questions:

What are the fundamental differences between the two algorithms?
What is the performance difference between the two algorithms?

302

asked Aug 28 '14 04:08

Bhavesh Shah

1 Answers

Levenshtein counts the number of edits (insertions, deletions, or substitutions) needed to convert one string to the other. Damerau-Levenshtein is a modified version that also considers transpositions as single edits. Although the output is the integer number of edits, this can be normalized to give a similarity value by the formula

1 - (edit distance / length of the larger of the two strings)

The Jaro algorithm is a measure of characters in common, being no more than half the length of the longer string in distance, with consideration for transpositions. Winkler modified this algorithm to support the idea that differences near the start of the string are more significant than differences near the end of the string. Jaro and Jaro-Winkler are suited for comparing smaller strings like words and names.

Deciding which to use is not just a matter of performance. It's important to pick a method that is suited to the nature of the strings you are comparing. In general though, both of the algorithms you mentioned can be expensive, because each string must be compared to every other string, and with millions of strings in your data set, that is a tremendous number of comparisons. That is much more expensive than something like computing a phonetic encoding for each string, and then simply grouping strings sharing identical encodings.

There is a wealth of detailed information on these algorithms and other fuzzy string matching algorithms on the internet. This one will give you a start:

A Comparison of Personal Name Matching: Techniques and Practical Issues

According to that paper, the speed of the four Jaro and Levenshtein algorithms I've mentioned are from fastest to slowest:

Jaro
Jaro-Winkler
Levenshtein
Damerau-Levenshtein

with the slowest taking 2 to 3 times as long as the fastest. Of course these times are dependent on the lengths of the strings and the implementations, and there are ways to optimize these algorithms that may not have been used.

191

answered Sep 19 '22 01:09

hatchet - done with SOverflow

Related questions
                            
                                value of using React.forwardRef vs custom ref prop
                            
                                Is multiplication faster than float division? [duplicate]
                            
                                Why is a conditional move not vulnerable for Branch Prediction Failure?
                            
                                Is Laravel really this slow?
                            
                                Android onClick in XML vs. OnClickListener
                            
                                Unobtrusive JavaScript: <script> at the top or the bottom of the HTML code?
                            
                                Python - How to check list monotonicity
                            
                                Are ES6 template literals faster than string concatenation?
                            
                                How do I do logging in C# without using 3rd party libraries? [closed]
                            
                                How do I insert a linebreak where the cursor is without entering into insert mode in Vim?
                            
                                Fast way to copy dictionary in Python
                            
                                How to analyze golang memory?
                            
                                Opening/closing tags & performance?
                            
                                How can I improve performance via a high-level approach when implementing long equations in C++
                            
                                Why are loops slow in R?
                            
                                Laravel Eloquent vs query builder - Why use eloquent to decrease performance [closed]
                            
                                How can I profile my Android app? [closed]
                            
                                Does use of anonymous functions affect performance?
                            
                                UUID performance in MySQL?
                            
                                Float vs Double Performance

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difference between Jaro-Winkler and Levenshtein distance? [closed]

Tags:

performance

algorithm

levenshtein-distance

jaro-winkler

Bhavesh Shah

People also ask

1 Answers

hatchet - done with SOverflow

Recent Activity

Donate For Us