Given are two python lists with strings in them (names of persons): <pre class="prettyprint"><code>list_1 = ['J. Payne', 'George Bush', 'Billy Idol', 'M Stuart', 'Luc van den Bergen'] list_2 = ['John Payne', 'George W. Bush', 'Billy Idol', 'M. Stuart', 'Luc Bergen'] </code></pre> I want a mapping of the names, that are most similar. <pre class="prettyprint"><code>'J. Payne' -> 'John Payne' 'George Bush' -> 'George W. Bush' 'Billy Idol' -> 'Billy Idol' 'M Stuart' -> 'M. Stuart' 'Luc van den Bergen' -> 'Luc Bergen' </code></pre> Is there a neat way to do this in python? The lists contain in average 5 or 6 Names. Sometimes more, but this is seldom. Sometimes it is just one name in every list, which could be spelled slightly different.

Using the function defined here: http://hetland.org/coding/python/levenshtein.py <pre class="prettyprint"><code>>>> for i in list_1: ... print i, '==>', min(list_2, key=lambda j:levenshtein(i,j)) ... </code></pre> <pre class="prettyprint"> J. Payne ==> John Payne George Bush ==> George W. Bush Billy Idol ==> Billy Idol M Stuart ==> M. Stuart Luc van den Bergen ==> Luc Bergen </pre> You could use functools.partial instead of the lambda <pre class="prettyprint"><code>>>> from functools import partial >>> for i in list_1: ... print i, '==>', min(list_2, key=partial(levenshtein,i)) ... </code></pre> <pre class="prettyprint"> J. Payne ==> John Payne George Bush ==> George W. Bush Billy Idol ==> Billy Idol M Stuart ==> M. Stuart Luc van den Bergen ==> Luc Bergen </pre>

Given two python lists of same length. How to return the best matches of similar values?

Tags:

python

string

list

mapping

Given are two python lists with strings in them (names of persons):

list_1 = ['J. Payne', 'George Bush', 'Billy Idol', 'M Stuart', 'Luc van den Bergen']
list_2 = ['John Payne', 'George W. Bush', 'Billy Idol', 'M. Stuart', 'Luc Bergen']

I want a mapping of the names, that are most similar.

'J. Payne'           -> 'John Payne'
'George Bush'        -> 'George W. Bush'
'Billy Idol'         -> 'Billy Idol'
'M Stuart'           -> 'M. Stuart'
'Luc van den Bergen' -> 'Luc Bergen'

Is there a neat way to do this in python? The lists contain in average 5 or 6 Names. Sometimes more, but this is seldom. Sometimes it is just one name in every list, which could be spelled slightly different.

486

asked Aug 15 '11 06:08

Aufwind

3 Answers

Using the function defined here: http://hetland.org/coding/python/levenshtein.py

>>> for i in list_1:
...     print i, '==>', min(list_2, key=lambda j:levenshtein(i,j))
...

J. Payne ==> John Payne
George Bush ==> George W. Bush
Billy Idol ==> Billy Idol
M Stuart ==> M. Stuart
Luc van den Bergen ==> Luc Bergen

You could use functools.partial instead of the lambda

>>> from functools import partial
>>> for i in list_1:
...     print i, '==>', min(list_2, key=partial(levenshtein,i))
...

J. Payne ==> John Payne
George Bush ==> George W. Bush
Billy Idol ==> Billy Idol
M Stuart ==> M. Stuart
Luc van den Bergen ==> Luc Bergen

166

answered Sep 19 '22 17:09

John La Rooy

You might try difflib:

import difflib

list_1 = ['J. Payne', 'George Bush', 'Billy Idol', 'M Stuart', 'Luc van den Bergen']
list_2 = ['John Payne', 'George W. Bush', 'Billy Idol', 'M. Stuart', 'Luc Bergen']

mymap = {}
for elem in list_1:
    closest = difflib.get_close_matches(elem, list_2)
    if closest:
        mymap[elem] = closest[0]

print mymap

output:

{'George Bush': 'George W. Bush', 
 'Luc van den Bergen': 'Luc Bergen', 
 'Billy Idol': 'Billy Idol', 
 'J. Payne': 'John Payne', 
 'M Stuart': 'M. Stuart'}

answered Sep 19 '22 17:09

Johannes Charra

Here is a variant of the given solutions that also optimizes the global minimum distance. It uses the Munkres assignment algorithm to ensure that the string pairings are optimal.

from munkres import Munkres
def match_lists(l1, l2):
    # Compute a matrix of string distances for all combinations of
    # items in l1 and l2.
    matrix = [[levenshtein(i1, i2) for i2 in l2] for i1 in l1]

    # Now figure out what the global minimum distance between the
    # pairs is.
    indexes = Munkres().compute(matrix)
    for row, col in indexes:
        yield l1[row], l2[col]

l1 = [
    'bolton',
    'manchester city',
    'manchester united',
    'wolves',
    'liverpool',
    'sunderland',
    'wigan',
    'norwich',
    'arsenal',
    'aston villa',
    'chelsea',
    'fulham',
    'newcastle utd',
    'stoke city',
    'everton',
    'tottenham',
    'blackburn',
    'west brom',
    'qpr',
    'swansea'
    ]
l2 = [
    'bolton wanderers',
    'manchester city',
    'manchester united',
    'wolverhampton',
    'liverpool',
    'norwich city',
    'sunderland',
    'wigan athletic',
    'arsenal',
    'aston villa',
    'chelsea',
    'fulham',
    'newcastle united',
    'stoke city',
    'everton',
    'tottenham hotspur',
    'blackburn rovers',
    'west bromwich',
    'queens park rangers',
    'swansea city'
    ]
for i1, i2 in match_lists(l1, l2):
    print i1, '=>', i2

For the lists given, where the differences more stems from alternative spellings and nicknames rather than spelling errors, this method gives better results than just using levenshtein or difflib. The munkres module can be found here: http://software.clapper.org/munkres/

answered Sep 19 '22 17:09

Björn Lindqvist

Related questions
                            
                                What's the most pythonic way of normalizing lineends in a string?
                            
                                Why do Python function docs include the comma after the bracket for optional args?
                            
                                django: gettext and coercing to unicode
                            
                                Python: Inheritance of a class attribute (list)
                            
                                Storing URLs while Spidering
                            
                                Match language code with countries where this language is an official or commonly used language
                            
                                What's the best way to aggregate the boolean values of a Python dictionary?
                            
                                django: unit testing html tags from response and sessions
                            
                                in python how to remove this \n from string or list [duplicate]
                            
                                ndarray field names for both row and column?
                            
                                Sum one row of a NumPy array
                            
                                Anyone know this Python data structure?
                            
                                Avoid object aliasing in python?
                            
                                HTML Truncating in Python
                            
                                How to filter list of dictionaries with matching values for a given key
                            
                                How to open all .txt and .log files in the current directory, search, and print the file the search was found
                            
                                Python - converting textfile contents into dictionary values/keys easily
                            
                                A source file with unicode characters is making Django throw up a SyntaxError exception
                            
                                Modifying dictionary values while iterating with dict.values() or dict.itervalues()
                            
                                Why does __init__ not get called if __new__ called with no args

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Given two python lists of same length. How to return the best matches of similar values?

Tags:

python

string

list

mapping

Aufwind

People also ask

3 Answers

John La Rooy

Johannes Charra

Björn Lindqvist

Recent Activity

Donate For Us