i have a dataframe with a column A as below :
Column A
Carrefour supermarket
Carrefour hypermarket
Carrefour
carrefour
Carrfour downtown
Carrfor market
Lulu
Lulu Hyper
Lulu dxb
lulu airport
k.m trading
KM Trading
KM trade
K.M. Trading
KM.Trading
I wanted to derive at the below "column A" :
Column A
Carrefour
Carrefour
Carrefour
Carrefour
Carrefour
Carrefour
Lulu
Lulu
Lulu
Lulu
KM Trading
KM Trading
KM Trading
KM Trading
KM Trading
To do this, i code as below :
MERCHANT_NAME_DICT = {"lulu": "Lulu", "carrefour": "Carrefour", "km": "KM Trading"}
def replace_merchant_name(row):
"""Provided a long merchant name replace it with short name."""
processed_row = re.sub(r'\s+|\.', '', row.lower()).strip()
for key, value in MERCHANT_NAME_DICT.items():
if key in processed_row:
return value
return row
frame['MERCHANT_NAME'] = frame['MERCHANT_NAME'].astype(str)
frame.MERCHANT_NAME = frame.MERCHANT_NAME.apply(lambda row: replace_merchant_name(row))
But i wanted to use NLP Logic and make it a generic function ( Instead of using values for mapping ). Just call the generic function and run it on any similar data column and get the desired results. I am pretty new to NLP Concepts, so looking for some help on it friends.
NOTE : Basically i wanted a generic NLP way coding to find all similar words from a given column ( or in a list ).
If you don't have a golden set of "right" merchant names, this sounds like a clustering problem. It can be solved with a clever distance function (like Jaro-Winkler from the Jindrich's answer) and a simple clustering algorithm (e.g. agglomerative clustering).
After having clustered the texts, you can find the most representative text from each cluster and replace with it the whole cluster.
import numpy as np
import re
import textdistance
# we will need scikit-learn>=0.21
from sklearn.cluster import AgglomerativeClustering
texts = [
'Carrefour supermarket', 'Carrefour hypermarket', 'Carrefour', 'carrefour', 'Carrfour downtown', 'Carrfor market',
'Lulu', 'Lulu Hyper', 'Lulu dxb', 'lulu airport',
'k.m trading', 'KM Trading', 'KM trade', 'K.M. Trading', 'KM.Trading'
]
def normalize(text):
""" Keep only lower-cased text and numbers"""
return re.sub('[^a-z0-9]+', ' ', text.lower())
def group_texts(texts, threshold=0.4):
""" Replace each text with the representative of its cluster"""
normalized_texts = np.array([normalize(text) for text in texts])
distances = 1 - np.array([
[textdistance.jaro_winkler(one, another) for one in normalized_texts]
for another in normalized_texts
])
clustering = AgglomerativeClustering(
distance_threshold=threshold, # this parameter needs to be tuned carefully
affinity="precomputed", linkage="complete", n_clusters=None
).fit(distances)
centers = dict()
for cluster_id in set(clustering.labels_):
index = clustering.labels_ == cluster_id
centrality = distances[:, index][index].sum(axis=1)
centers[cluster_id] = normalized_texts[index][centrality.argmin()]
return [centers[i] for i in clustering.labels_]
print(group_texts(texts))
The code above will print its output as
['carrefour', 'carrefour', 'carrefour', 'carrefour', 'carrefour', 'carrefour',
'lulu', 'lulu', 'lulu', 'lulu',
'km trading', 'km trading', 'km trading', 'km trading', 'km trading']
As a baseline, this function will do. You may want to improve it by modifying the distance function so that it would reflect your domain more closely. For example:
Unfortunately, most of such adjustments would be domain-specific, so you will have to tune them to your own dataset.
You can do something as: for each word that is not in a spellchecking vocabulary (and thus is likely to be misspelled), have a look in your list of merchant names and have a look if there is a name with a small edit distance. You can also somehow normalize the words for the similarity search, i.e., lowercase everything and remove punctuation.
You can use textdistance
package that implements plenty of string distances. I would probably use Jaro-Winkler distance for this purpose.
import string
import textdistance
MERCHANT_NAMES = [("lulu", "Lulu"), ("carrefour", "Carrefour"), ("km", "KM")]
DISTANCE_THRESHOLD = 0.9
def normalize(orig_name):
name_sig = orig_name.translate(str.maketrans('', '', string.punctuation)).lower()
best_score = DISTANCE_THRESHOLD
replacement = name
for sig, name in MERCHANT_NAMES:
distance = textdistance.jaro_winkler(name_sig, sig)
if distance > best_score:
best_score = distance
replacement = name
return replacement
You will probably need to tune what is an acceptable threshold for word replacement and do something with multi-word expressions. (E.g., throw away words similar to "supermarket", "hypermarket", etc.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With