Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Given two python lists of same length. How to return the best matches of similar values?

Given are two python lists with strings in them (names of persons):

list_1 = ['J. Payne', 'George Bush', 'Billy Idol', 'M Stuart', 'Luc van den Bergen']
list_2 = ['John Payne', 'George W. Bush', 'Billy Idol', 'M. Stuart', 'Luc Bergen']

I want a mapping of the names, that are most similar.

'J. Payne'           -> 'John Payne'
'George Bush'        -> 'George W. Bush'
'Billy Idol'         -> 'Billy Idol'
'M Stuart'           -> 'M. Stuart'
'Luc van den Bergen' -> 'Luc Bergen'

Is there a neat way to do this in python? The lists contain in average 5 or 6 Names. Sometimes more, but this is seldom. Sometimes it is just one name in every list, which could be spelled slightly different.

like image 486
Aufwind Avatar asked Aug 15 '11 06:08

Aufwind


People also ask

How do you compare two lists in identical Python?

A straightforward way to check the equality of the two lists in Python is by using the equality == operator. When the equality == is used on the list type in Python, it returns True if the lists are equal and False if they are not.

How do I compare two list values in Python?

Using list.sort() method sorts the two lists and the == operator compares the two lists item by item which means they have equal data items at equal positions. This checks if the list contains equal data item values but it does not take into account the order of elements in the list.


3 Answers

Using the function defined here: http://hetland.org/coding/python/levenshtein.py

>>> for i in list_1:
...     print i, '==>', min(list_2, key=lambda j:levenshtein(i,j))
... 
J. Payne ==> John Payne
George Bush ==> George W. Bush
Billy Idol ==> Billy Idol
M Stuart ==> M. Stuart
Luc van den Bergen ==> Luc Bergen

You could use functools.partial instead of the lambda

>>> from functools import partial
>>> for i in list_1:
...     print i, '==>', min(list_2, key=partial(levenshtein,i))
...
J. Payne ==> John Payne
George Bush ==> George W. Bush
Billy Idol ==> Billy Idol
M Stuart ==> M. Stuart
Luc van den Bergen ==> Luc Bergen
like image 166
John La Rooy Avatar answered Sep 19 '22 17:09

John La Rooy


You might try difflib:

import difflib

list_1 = ['J. Payne', 'George Bush', 'Billy Idol', 'M Stuart', 'Luc van den Bergen']
list_2 = ['John Payne', 'George W. Bush', 'Billy Idol', 'M. Stuart', 'Luc Bergen']

mymap = {}
for elem in list_1:
    closest = difflib.get_close_matches(elem, list_2)
    if closest:
        mymap[elem] = closest[0]

print mymap

output:

{'George Bush': 'George W. Bush', 
 'Luc van den Bergen': 'Luc Bergen', 
 'Billy Idol': 'Billy Idol', 
 'J. Payne': 'John Payne', 
 'M Stuart': 'M. Stuart'}
like image 20
Johannes Charra Avatar answered Sep 19 '22 17:09

Johannes Charra


Here is a variant of the given solutions that also optimizes the global minimum distance. It uses the Munkres assignment algorithm to ensure that the string pairings are optimal.

from munkres import Munkres
def match_lists(l1, l2):
    # Compute a matrix of string distances for all combinations of
    # items in l1 and l2.
    matrix = [[levenshtein(i1, i2) for i2 in l2] for i1 in l1]

    # Now figure out what the global minimum distance between the
    # pairs is.
    indexes = Munkres().compute(matrix)
    for row, col in indexes:
        yield l1[row], l2[col]

l1 = [
    'bolton',
    'manchester city',
    'manchester united',
    'wolves',
    'liverpool',
    'sunderland',
    'wigan',
    'norwich',
    'arsenal',
    'aston villa',
    'chelsea',
    'fulham',
    'newcastle utd',
    'stoke city',
    'everton',
    'tottenham',
    'blackburn',
    'west brom',
    'qpr',
    'swansea'
    ]
l2 = [
    'bolton wanderers',
    'manchester city',
    'manchester united',
    'wolverhampton',
    'liverpool',
    'norwich city',
    'sunderland',
    'wigan athletic',
    'arsenal',
    'aston villa',
    'chelsea',
    'fulham',
    'newcastle united',
    'stoke city',
    'everton',
    'tottenham hotspur',
    'blackburn rovers',
    'west bromwich',
    'queens park rangers',
    'swansea city'
    ]
for i1, i2 in match_lists(l1, l2):
    print i1, '=>', i2

For the lists given, where the differences more stems from alternative spellings and nicknames rather than spelling errors, this method gives better results than just using levenshtein or difflib. The munkres module can be found here: http://software.clapper.org/munkres/

like image 28
Björn Lindqvist Avatar answered Sep 19 '22 17:09

Björn Lindqvist