Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String similarity metrics in Python

I want to find string similarity between two strings. This page has examples of some of them. Python has an implemnetation of Levenshtein algorithm. Is there a better algorithm, (and hopefully a python library), under these contraints.

  1. I want to do fuzzy matches between strings. eg matches('Hello, All you people', 'hello, all You peopl') should return True
  2. False negatives are acceptable, False positives, except in extremely rare cases are not.
  3. This is done in a non realtime setting, so speed is not (much) of concern.
  4. [Edit] I am comparing multi word strings.

Would something other than Levenshtein distance(or Levenshtein ratio) be a better algorithm for my case?

like image 227
agiliq Avatar asked Sep 24 '09 11:09

agiliq


People also ask

How do I check if a string is similar in Python?

Comparing strings using the == and != The simplest way to check if two strings are equal in Python is to use the == operator. And if you are looking for the opposite, then != is what you need. That's it!

How do you measure string similarity?

The way to check the similarity between any data point or groups is by calculating the distance between those data points. In textual data as well, we check the similarity between the strings by calculating the distance between one text to another text.

How do you match two similar strings?

Using String. equals() :In Java, string equals() method compares the two given strings based on the data/content of the string. If all the contents of both the strings are same then it returns true. If any character does not match, then it returns false.

How do I check if two strings have the same character in Python?

String Comparison using == in PythonThe == function compares the values of two strings and returns if they are equal or not. If the strings are equal, it returns True, otherwise it returns False.


1 Answers

I realize it's not the same thing, but this is close enough:

>>> import difflib >>> a = 'Hello, All you people' >>> b = 'hello, all You peopl' >>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower()) >>> seq.ratio() 0.97560975609756095 

You can make this as a function

def similar(seq1, seq2):     return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9  >>> similar(a, b) True >>> similar('Hello, world', 'Hi, world') False 
like image 95
Nadia Alramli Avatar answered Sep 27 '22 22:09

Nadia Alramli