Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to map the most "similar" strings from one list to another in python?

Given are two lists containing strings.

  1. One contains the name of organisations (mostly universitys) all around the world - not only written in english but always using latin alphabet.

  2. The other list contains mostly full addresses in which strings (organisations) from the first list may occur.

An Example:

addresses = [
             "Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium",
             "Machine Learning and Computational Biology Research Group, Max Planck Institutes     Tübingen, Tübingen, Germany 72076",
             "Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185",
             "Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754",    
             "Computer Science Department, University of California, Santa Barbara, USA 93106",
             "Fraunhofer IAIS, Sankt Augustin, Germany",
             "Department of Computer Science, Cornell University, Ithaca, NY",
             "University of Wisconsin-Madison"
            ]

organisations = [
                 "Catholic University of Leuven"
                 "Fraunhofer IAIS"
                 "Cornell University of Ithaca"
                 "Tübingener Max Plank Institut"
                ]

As you can see the desired mapping would be:

"Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium",
--> Catholic University of  Leuven
"Machine Learning and Computational Biology Research Group, Max Planck Institutes     Tübingen, Tübingen, Germany 72076",
--> Max Plank Institut Tübingen
"Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185",
--> --
"Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754",
--> Fraunhofer IAIS 
"Computer Science Department, University of California, Santa Barbara, USA 93106",
"Fraunhofer IAIS, Sankt Augustin, Germany",
--> Fraunhofer IAIS
"Department of Computer Science, Cornell University, Ithaca, NY"
--> "Cornell University of Ithaca",
"University of Wisconsin-Madison",
--> --

My thinking was to use some kind of "disctance- algorithm" to calculate the similarity of the strings. Since I cannot just look for an organisation in an address just by doing if address in organisation because it could be written slightly differently at in different places. So my first guess was using the difflib module. Especially the difflib.get_close_matches() function for selecting for every address the closest string from the organisations list. But I am not quite confident, that the results will be accurate enough. Although I don't know how high I should set the ratio which seams to be a similarity measure.

Before spending too much time in trying the difflib module I thought of asking the more experienced people here, if this is the right approach or if there is a more suited tool / way to solve my problem. Thanks!

PS: I don't need an optimal solution.

like image 522
Aufwind Avatar asked Dec 08 '11 14:12

Aufwind


1 Answers

Use the following as your string distance function (instead of plain levenshtein distance):

def strdist(s1, s2):
    words1 = set(w for w in s1.split() if len(w) > 3)
    words2 = set(w for w in s2.split() if len(w) > 3)

    scores = [min(levenshtein(w1, w2) for w2 in words2) for w1 in words1]
    n_shared_words = len([s for s in scores if s <= 3])
    return -n_shared_words 

Then use the Munkres assignment algorithm shown here since there appears to be a 1:1 mapping between organisations and adresses.

like image 66
Björn Lindqvist Avatar answered Nov 15 '22 00:11

Björn Lindqvist