High performance fuzzy string comparison in Python, use Levenshtein or difflib [closed]

People also ask

What is levenshtein fuzzy matching?

The concept of fuzzy matching is to calculate similarity between any two given strings. And this is achieved by making use of the Levenshtein Distance between the two strings. fuzzywuzzy is an inbuilt package you find inside python which has certain functions in it which does all this calculation for us.

What is levenshtein ratio?

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.

In case you're interested in a quick visual comparison of Levenshtein and Difflib similarity, I calculated both for ~2.3 million book titles:

import codecs, difflib, Levenshtein, distance

with codecs.open("titles.tsv","r","utf-8") as f:
    title_list = f.read().split("\n")[:-1]

    for row in title_list:

        sr      = row.lower().split("\t")

        diffl   = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()
        lev     = Levenshtein.ratio(sr[3], sr[4]) 
        sor     = 1 - distance.sorensen(sr[3], sr[4])
        jac     = 1 - distance.jaccard(sr[3], sr[4])

        print diffl, lev, sor, jac

I then plotted the results with R:

enter image description here

Strictly for the curious, I also compared the Difflib, Levenshtein, Sørensen, and Jaccard similarity values:

library(ggplot2)
require(GGally)

difflib <- read.table("similarity_measures.txt", sep = " ")
colnames(difflib) <- c("difflib", "levenshtein", "sorensen", "jaccard")

ggpairs(difflib)

Result: enter image description here

The Difflib / Levenshtein similarity really is quite interesting.

2018 edit: If you're working on identifying similar strings, you could also check out minhashing--there's a great overview here. Minhashing is amazing at finding similarities in large text collections in linear time. My lab put together an app that detects and visualizes text reuse using minhashing here: https://github.com/YaleDHLab/intertext

difflib.SequenceMatcher uses the Ratcliff/Obershelp algorithm it computes the doubled number of matching characters divided by the total number of characters in the two strings.
Levenshtein uses Levenshtein algorithm it computes the minimum number of edits needed to transform one string into the other

Complexity

SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common. (from here)

Levenshtein is O(m*n), where n and m are the length of the two input strings.

Performance

According to the source code of the Levenshtein module : Levenshtein has a some overlap with difflib (SequenceMatcher). It supports only strings, not arbitrary sequence types, but on the other hand it's much faster.

Related questions
                            
                                When should I use ugettext_lazy?
                            
                                Cython: "fatal error: numpy/arrayobject.h: No such file or directory"
                            
                                What is the difference between os.path.basename() and os.path.dirname()?
                            
                                How to convert a boolean array to an int array
                            
                                Get Output From the logging Module in IPython Notebook
                            
                                Why were pandas merges in python faster than data.table merges in R in 2012?
                            
                                OSError: [Errno 2] No such file or directory while using python subprocess in Django
                            
                                Adding a y-axis label to secondary y-axis in matplotlib
                            
                                What's the difference between %s and %d in Python string formatting?
                            
                                What is the difference between isinstance('aaa', basestring) and isinstance('aaa', str)?
                            
                                Python debugging tips [closed]
                            
                                Flask-SQLalchemy update a row's information
                            
                                Add SUM of values of two LISTS into new LIST
                            
                                Why can't non-default arguments follow default arguments?
                            
                                Split string based on a regular expression
                            
                                Why aren't superclass __init__ methods automatically invoked?
                            
                                Convert timedelta to years?
                            
                                How can I remove a pytz timezone from a datetime object?
                            
                                Combining conda environment.yml with pip requirements.txt
                            
                                TensorFlow, why was python the chosen language?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

High performance fuzzy string comparison in Python, use Levenshtein or difflib [closed]

Tags:

python

string-matching

levenshtein-distance

difflib

People also ask

Recent Activity

Donate For Us