I have to make a cross-validation for some data based on names.
The problem I'm facing is that depending on the source, names have slight variations, for example:
L & L AIR CONDITIONING vs L & L AIR CONDITIONING Service
BEST ROOFING vs ROOFING INC
I have several thousands of records so do it manually will be very time demanding, I want to automate the process as much as possible.
Since there are additional words it wouldn't be enough to lowercase the names.
Which are good algorithms to handle this?
Maybe to calculate the correlation giving low weight to words like 'INC' or 'Service'
Edit:
I tried the difflib library
difflib.SequenceMatcher(None,name_1.lower(),name_2.lower()).ratio()
I'm getting a decent result with it.
I would use cosine similarity to achieve the same. It will give you a matching score of how close the strings are.
Here is the code to help you with the same (I remember getting this code from Stackoverflow itself, some months ago - couldn't find the link now)
import re, math
from collections import Counter
WORD = re.compile(r'\w+')
def get_cosine(vec1, vec2):
# print vec1, vec2
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
return Counter(WORD.findall(text))
def get_similarity(a, b):
a = text_to_vector(a.strip().lower())
b = text_to_vector(b.strip().lower())
return get_cosine(a, b)
get_similarity('L & L AIR CONDITIONING', 'L & L AIR CONDITIONING Service') # returns 0.9258200997725514
Another version that I found useful was slightly NLP based and I authored it.
import re, math
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.porter import *
from nltk.corpus import wordnet as wn
stop = stopwords.words('english')
WORD = re.compile(r'\w+')
stemmer = PorterStemmer()
def get_cosine(vec1, vec2):
# print vec1, vec2
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
words = WORD.findall(text)
a = []
for i in words:
for ss in wn.synsets(i):
a.extend(ss.lemma_names())
for i in words:
if i not in a:
a.append(i)
a = set(a)
w = [stemmer.stem(i) for i in a if i not in stop]
return Counter(w)
def get_similarity(a, b):
a = text_to_vector(a.strip().lower())
b = text_to_vector(b.strip().lower())
return get_cosine(a, b)
def get_char_wise_similarity(a, b):
a = text_to_vector(a.strip().lower())
b = text_to_vector(b.strip().lower())
s = []
for i in a:
for j in b:
s.append(get_similarity(str(i), str(j)))
try:
return sum(s)/float(len(s))
except: # len(s) == 0
return 0
get_similarity('I am a good boy', 'I am a very disciplined guy')
# Returns 0.5491201525567068
You can call both get_similarity
or get_char_wise_similarity
to see what works for your use case better. I used both - normal similarity to weed out really close ones, and then character wise similarity to weed out close enough ones. And then the remaining ones had to be dealt with manually.
You might be able to just use the Levenshtein distance, which is a good way to calculate difference between two strings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With