Given are two python lists with strings in them (names of persons):
list_1 = ['J. Payne', 'George Bush', 'Billy Idol', 'M Stuart', 'Luc van den Bergen']
list_2 = ['John Payne', 'George W. Bush', 'Billy Idol', 'M. Stuart', 'Luc Bergen']
I want a mapping of the names, that are most similar.
'J. Payne' -> 'John Payne'
'George Bush' -> 'George W. Bush'
'Billy Idol' -> 'Billy Idol'
'M Stuart' -> 'M. Stuart'
'Luc van den Bergen' -> 'Luc Bergen'
Is there a neat way to do this in python? The lists contain in average 5 or 6 Names. Sometimes more, but this is seldom. Sometimes it is just one name in every list, which could be spelled slightly different.
A straightforward way to check the equality of the two lists in Python is by using the equality == operator. When the equality == is used on the list type in Python, it returns True if the lists are equal and False if they are not.
Using list.sort() method sorts the two lists and the == operator compares the two lists item by item which means they have equal data items at equal positions. This checks if the list contains equal data item values but it does not take into account the order of elements in the list.
Using the function defined here: http://hetland.org/coding/python/levenshtein.py
>>> for i in list_1:
... print i, '==>', min(list_2, key=lambda j:levenshtein(i,j))
...
J. Payne ==> John Payne George Bush ==> George W. Bush Billy Idol ==> Billy Idol M Stuart ==> M. Stuart Luc van den Bergen ==> Luc Bergen
You could use functools.partial instead of the lambda
>>> from functools import partial
>>> for i in list_1:
... print i, '==>', min(list_2, key=partial(levenshtein,i))
...
J. Payne ==> John Payne George Bush ==> George W. Bush Billy Idol ==> Billy Idol M Stuart ==> M. Stuart Luc van den Bergen ==> Luc Bergen
You might try difflib
:
import difflib
list_1 = ['J. Payne', 'George Bush', 'Billy Idol', 'M Stuart', 'Luc van den Bergen']
list_2 = ['John Payne', 'George W. Bush', 'Billy Idol', 'M. Stuart', 'Luc Bergen']
mymap = {}
for elem in list_1:
closest = difflib.get_close_matches(elem, list_2)
if closest:
mymap[elem] = closest[0]
print mymap
output:
{'George Bush': 'George W. Bush',
'Luc van den Bergen': 'Luc Bergen',
'Billy Idol': 'Billy Idol',
'J. Payne': 'John Payne',
'M Stuart': 'M. Stuart'}
Here is a variant of the given solutions that also optimizes the global minimum distance. It uses the Munkres assignment algorithm to ensure that the string pairings are optimal.
from munkres import Munkres
def match_lists(l1, l2):
# Compute a matrix of string distances for all combinations of
# items in l1 and l2.
matrix = [[levenshtein(i1, i2) for i2 in l2] for i1 in l1]
# Now figure out what the global minimum distance between the
# pairs is.
indexes = Munkres().compute(matrix)
for row, col in indexes:
yield l1[row], l2[col]
l1 = [
'bolton',
'manchester city',
'manchester united',
'wolves',
'liverpool',
'sunderland',
'wigan',
'norwich',
'arsenal',
'aston villa',
'chelsea',
'fulham',
'newcastle utd',
'stoke city',
'everton',
'tottenham',
'blackburn',
'west brom',
'qpr',
'swansea'
]
l2 = [
'bolton wanderers',
'manchester city',
'manchester united',
'wolverhampton',
'liverpool',
'norwich city',
'sunderland',
'wigan athletic',
'arsenal',
'aston villa',
'chelsea',
'fulham',
'newcastle united',
'stoke city',
'everton',
'tottenham hotspur',
'blackburn rovers',
'west bromwich',
'queens park rangers',
'swansea city'
]
for i1, i2 in match_lists(l1, l2):
print i1, '=>', i2
For the lists given, where the differences more stems from alternative spellings and nicknames rather than spelling errors, this method gives better results than just using levenshtein or difflib. The munkres module can be found here: http://software.clapper.org/munkres/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With