Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String similarity in Python

I'm trying to do a comparison of strings in Python. My strings contain titles which can be structured a number of different ways:

'Title'
'Title: Subtitle'
'Title - Subtitle'
'Title, Subtitle'
'Title Subtitle'

Is it possible to do similarity comparison in Python so that it can determine that match('Title: Subtitle', 'Title - Subtitle') = True? (or however it would be constructed)

Basically I'm trying to determine if they're the same title even if the splitting is different.

if 'Title: Subtitle' == 'Title - Subtitle':
    match = 'True'
else:
    match = 'False'

There are also some that might be stored as The Title: The Subtitle or Title, The: Subtitle, The although I think that may add a bit of complexity I could probably get around by reconstructing the string.

like image 626
Midavalo Avatar asked Mar 27 '16 21:03

Midavalo


3 Answers

What you're trying to do has already been implemented very well in the jellyfish package.

>>> import jellyfish
>>> jellyfish.levenshtein_distance('jellyfish', 'smellyfish')
2
like image 68
DevShark Avatar answered Oct 22 '22 03:10

DevShark


The standard library's difflib module provides a function get_close_matches which does fuzzy string matching.

>>> import difflib
>>> difflib.get_close_matches('python', ['snakes', 'thon.py', 'pythin'])
['pythin', 'thon.py']  # ordered by similarity score
like image 25
Todd Owen Avatar answered Oct 22 '22 03:10

Todd Owen


You can use in keyword. It isn't a similarity comparison, but does what you want:

s = "Title: Subtitle"

if "Title" in s or "Subtitle" in s:
    match = 'True'
else:
    match = 'False'
like image 41
xdola Avatar answered Oct 22 '22 03:10

xdola