Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Figure out if a business name is very similar to another one - Python

I'm working with a large database of businesses.

I'd like to be able to compare two business names for similarity to see if they possibly might be duplicates.

Below is a list of business names that should test as having a high probability of being duplicates, what is a good way to go about this?

 George Washington Middle Schl George Washington School  Santa Fe East Inc Santa Fe East  Chop't Creative Salad Co Chop't Creative Salad Company  Manny and Olga's Pizza Manny's & Olga's Pizza  Ray's Hell Burger Too Ray's Hell Burgers  El Sol El Sol de America  Olney Theatre Center for the Arts Olney Theatre  21 M Lounge 21M Lounge  Holiday Inn Hotel Washington Holiday Inn Washington-Georgetown  Residence Inn Washington,DC/Dupont Circle Residence Inn Marriott Dupont Circle  Jimmy John's Gourmet Sandwiches Jimmy John's  Omni Shoreham Hotel at Washington D.C. Omni Shoreham Hotel 
like image 952
Chris Dutrow Avatar asked Jun 19 '11 03:06

Chris Dutrow


People also ask

How do you check if two names are the same in Python?

Python comparison operators This is done using the following operators: == : This checks whether two strings are equal. != : This checks if two strings are not equal.


2 Answers

I've recently done a similar task, although I was matching new data to existing names in a database, rather than looking for duplicates within one set. Name matching is actually a well-studied task, with a number of factors beyond what you'd consider for matching generic strings.

First, I'd recommend taking a look at a paper, How to play the “Names Game”: Patent retrieval comparing different heuristics by Raffo and Lhuillery. The published version is here, and a PDF is freely available here. The authors provide a nice summary, comparing a number of different matching strategies. They consider three stages, which they call parsing, matching, and filtering.

Parsing consists of applying various cleaning techniques. Some examples:

  • Standardizing lettercase (e.g., all lowercase)
  • Standardizing punctuation (e.g., commas must be followed by spaces)
  • Standardizing whitespace (e.g., converting all runs of whitespace to single spaces)
  • Standardizing accented and special characters (e.g., converting accented letters to ASCII equivalents)
  • Standardizing legal control terms (e.g., converting "Co." to "Company")

In my case, I folded all letters to lowercase, replaced all punctuation with whitespace, replaced accented characters by unaccented counterparts, removed all other special characters, and removed legal control terms from the beginning and ends of the names following a list.

Matching is the comparison of the parsed names. This could be simple string matching, edit distance, Soundex or Metaphone, comparison of the sets of words making up the names, or comparison of sets of letters or n-grams (letter sequences of length n). The n-gram approach is actually quite nice for names, as it ignores word order, helping a lot with things like "department of examples" vs. "examples department". In fact, comparing bigrams (2-grams, character pairs) using something simple like the Jaccard index is very effective. In contrast to several other suggestions, Levenshtein distance is one of the poorer approaches when it comes to name matching.

In my case, I did the matching in two steps, first with comparing the parsed names for equality and then using the Jaccard index for the sets of bigrams on the remaining. Rather than actually calculating all the Jaccard index values for all pairs of names, I first put a bound on the maximum possible value for the Jaccard index for two sets of given size, and only computed the Jaccard index if that upper bound was high enough to potentially be useful. Most of the name pairs were still dissimilar enough that they weren't matches, but it dramatically reduced the number of comparisons made.

Filtering is the use of auxiliary data to reject false positives from the parsing and matching stages. A simple version would be to see if matching names correspond to businesses in different cities, and thus different businesses. That example could be applied before matching, as a kind of pre-filtering. More complicated or time-consuming checks might be applied afterwards.

I didn't do much filtering. I checked the countries for the firms to see if they were the same, and that was it. There weren't really that many possibilities in the data, some time constraints ruled out any extensive search for additional data to augment the filtering, and there was a manual checking planned, anyway.

like image 134
Michael J. Barber Avatar answered Sep 21 '22 00:09

Michael J. Barber


I'd like to add some examples to the excellent accepted answer. Tested in Python 2.7.

Parsing

Let's use this odd name as an example.

name = "THE |  big,- Pharma: LLC"  # example of a company name 

We can start with removing legal control terms (here LLC). To do that, there is an awesome cleanco Python library, which does exactly that:

from cleanco import cleanco name = cleanco(name).clean_name()  # 'THE | big,- Pharma' 

Remove all punctuation:

name = name.translate(None, string.punctuation)  # 'THE  big Pharma' 

(for unicode strings, the following code works instead (source, regex):

import regex name = regex.sub(ur"[[:punct:]]+", "", name)  # u'THE  big Pharma' 

Split the name into tokens using NLTK:

import nltk tokens = nltk.word_tokenize(name)  # ['THE', 'big', 'Pharma'] 

Lowercase all tokens:

tokens = [t.lower() for t in tokens]  # ['the', 'big', 'pharma'] 

Remove stop words. Note that it might cause problems with companies like On Mars will be incorrectly matched to Mars, because On is a stopword.

from nltk.corpus import stopwords tokens = [t for t in tokens if t not in stopwords.words('english')]  # ['big', 'pharma'] 

I don't cover accented and special characters here (improvements welcome).

Matching

Now, when we have mapped all company names to tokens, we want to find the matching pairs. Arguably, Jaccard (or Jaro-Winkler) similarity is better than Levenstein for this task, but is still not good enough. The reason is that it does not take into account the importance of words in the name (like TF-IDF does). So common words like "Company" influence the score just as much as words that might uniquely identify company name.

To improve on that, you can use a name similarity trick suggested in this awesome series of posts (not mine). Here is a code example from it:

# token2frequency is just a word counter of all words in all names # in the dataset def sequence_uniqueness(seq, token2frequency):     return sum(1/token2frequency(t)**0.5 for t in seq)  def name_similarity(a, b, token2frequency):     a_tokens = set(a.split())     b_tokens = set(b.split())     a_uniq = sequence_uniqueness(a_tokens)     b_uniq = sequence_uniqueness(b_tokens)     return sequence_uniqueness(a.intersection(b))/(a_uniq * b_uniq) ** 0.5 

Using that, you can match names with similarity exceeding certain threshold. As a more complex approach, you can also take several scores (say, this uniqueness score, Jaccard and Jaro-Winkler) and train a binary classification model using some labeled data, which will, given a number of scores, output if the candidate pair is a match or not. More on this can be found in the same blog post.

like image 36
Dennis Golomazov Avatar answered Sep 23 '22 00:09

Dennis Golomazov