I want to find string similarity between two strings. This page has examples of some of them. Python has an implemnetation of Levenshtein algorithm. Is there a better algorithm, (and hopefully a python library), under these contraints. <ol> <li>I want to do fuzzy matches between strings. eg matches('Hello, All you people', 'hello, all You peopl') should return True</li> <li>False negatives are acceptable, False positives, except in extremely rare cases are not.</li> <li>This is done in a non realtime setting, so speed is not (much) of concern.</li> <li>[Edit] I am comparing multi word strings.</li> </ol> Would something other than Levenshtein distance(or Levenshtein ratio) be a better algorithm for my case?

I realize it's not the same thing, but this is close enough: <pre class="prettyprint"><code>>>> import difflib >>> a = 'Hello, All you people' >>> b = 'hello, all You peopl' >>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower()) >>> seq.ratio() 0.97560975609756095 </code></pre> You can make this as a function <pre class="prettyprint"><code>def similar(seq1, seq2): return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9 >>> similar(a, b) True >>> similar('Hello, world', 'Hi, world') False </code></pre>

String similarity metrics in Python

Tags:

python

string

algorithm

levenshtein-distance

I want to find string similarity between two strings. This page has examples of some of them. Python has an implemnetation of Levenshtein algorithm. Is there a better algorithm, (and hopefully a python library), under these contraints.

I want to do fuzzy matches between strings. eg matches('Hello, All you people', 'hello, all You peopl') should return True
False negatives are acceptable, False positives, except in extremely rare cases are not.
This is done in a non realtime setting, so speed is not (much) of concern.
[Edit] I am comparing multi word strings.

Would something other than Levenshtein distance(or Levenshtein ratio) be a better algorithm for my case?

227

asked Sep 24 '09 11:09

agiliq

1 Answers

I realize it's not the same thing, but this is close enough:

Click to copy

>>> import difflib >>> a = 'Hello, All you people' >>> b = 'hello, all You peopl' >>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower()) >>> seq.ratio() 0.97560975609756095

You can make this as a function

Click to copy

def similar(seq1, seq2):     return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9  >>> similar(a, b) True >>> similar('Hello, world', 'Hi, world') False

answered Sep 27 '22 22:09

Nadia Alramli

Related questions
                            
                                Select iframe using Python + Selenium
                            
                                Getting the array as GET query parameters in Python
                            
                                How to multiply individual elements of a list with a number?
                            
                                Crontab not executing a Python script? [duplicate]
                            
                                Python vs. Java performance (runtime speed) [duplicate]
                            
                                ImportError: No module named _ssl
                            
                                How to take the nth digit of a number in python
                            
                                Get hard disk size in Python
                            
                                What are good ways to make my Python code run first time? [closed]
                            
                                Progress Bar while download file over http with Requests
                            
                                Matplotlib: Nose, Tornado
                            
                                Convert python long/int to fixed size byte array
                            
                                matplotlib: overlay plots with different scales?
                            
                                How to add readonly inline on django admin
                            
                                PySpark: multiple conditions in when clause
                            
                                pip install mysqlclient returns "fatal error C1083: Cannot open file: 'mysql.h': No such file or directory
                            
                                My matplotlib.pyplot legend is being cut off
                            
                                Rotate point about another point in degrees python
                            
                                python 2 instead of python 3 as the (temporary) default python?
                            
                                building Python from source with zlib support

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With