Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get close matches for multiple words in a dictionary

I have a dictionary with the following structure:

{
    1: {"names": ["name1_A", "name1_B", ...]},
    2: {"names": ["name2_A", "name2_B", ...]},
    ...
}

where name1_A and name1_B are synonyms/aliases/different ways to write the same name, whose ID is 1. name2_A and name2_B are aliases of the same name, whose ID is 2, and so on.

I need to write a function which takes a user input and returns the ID of the name whose alias is most similar to the user input.

I know it's not very intuitive to understand what I mean, so here's an example. Let's say this is my dictionary:

{
    1: {"names": ["James", "Jamie"]},
    2: {"names": ["Karen", "Karyn"]}
}

The user types in the word Jimmy. Since the closest match to Jimmy from the dictionary is Jamie, the function has to return the ID 1.

If the user types in the world Karena, since the closest match is Karen, the function has to return the ID 2.

I think the best way to get the closest math is to use difflib's get_close_matches(). However, that function takes a list of possibilities as argument, and I cannot think of a way to correctly use it in my function. Any help would be appreciated.

like image 353
user2747949 Avatar asked Jun 29 '17 21:06

user2747949


1 Answers

If you're interested in 3rd party modules, there's a nice little module I like to use for this sort of thing called fuzzywuzzy, for fuzzy string matching in Python. This module uses the Levenshtein Distance metric for computing distances between two strings. Here's an example of how you use it:

>>> from fuzzywuzzy import fuzz
>>> from functools import partial
>>> data_dict = {
...     1: {"names": ["James", "Jamie"]},
...     2: {"names": ["Karen", "Karyn"]}
... }
>>> input_str = 'Karena'
>>> f = partial(fuzz.partial_ratio, input_str)
>>> matches = { k : max(data_dict[k]['names'], key=f) for k in data_dict}
>>> matches
{1: 'James', 2: 'Karen'}
>>> { i : (matches[i], f(matches[i])) for i in matches }
{1: ('James', 40), 2: ('Karen', 100)}

Now, you can extract Karen since it has the highest score.

I've had to call the function twice for the purpose of this demo, but you should be able to do that just once depending on how you expand upon this example.

Another thing to note is that fuzz.partial_ratio is more lenient with its matches. For a stricter matching scheme, consider using fuzz.ratio.

You can peruse more examples using fuzzy string matching here.

like image 151
cs95 Avatar answered Sep 28 '22 03:09

cs95