Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a good strategy to group similar words?

Say I have a list of movie names with misspellings and small variations like this -

 "Pirates of the Caribbean: The Curse of the Black Pearl"
 "Pirates of the carribean"
 "Pirates of the Caribbean: Dead Man's Chest"
 "Pirates of the Caribbean trilogy"
 "Pirates of the Caribbean"
 "Pirates Of The Carribean"

How do I group or find such sets of words, preferably using python and/or redis?

like image 484
abc def foo bar Avatar asked Jul 05 '11 07:07

abc def foo bar


1 Answers

Have a look at "fuzzy matching". Some great tools in the thread below that calculates similarities between strings.

I'm especially fond of the difflib module

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']

https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison

like image 124
Fredrik Pihl Avatar answered Sep 28 '22 02:09

Fredrik Pihl