Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text normalization : Text similarity in Python. How to normalize Text spelling mismatch?

Tags:

python-3.x

nlp

i have a dataframe with a column A as below :

Column A
Carrefour supermarket
Carrefour hypermarket
Carrefour
carrefour
Carrfour downtown
Carrfor market
Lulu
Lulu Hyper
Lulu dxb
lulu airport
k.m trading
KM Trading
KM trade
K.M.  Trading
KM.Trading

I wanted to derive at the below "column A" :

Column A
Carrefour
Carrefour
Carrefour
Carrefour
Carrefour
Carrefour
Lulu
Lulu
Lulu
Lulu
KM Trading
KM Trading
KM Trading
KM Trading
KM Trading

To do this, i code as below :

MERCHANT_NAME_DICT = {"lulu": "Lulu", "carrefour": "Carrefour",  "km": "KM Trading"}

def replace_merchant_name(row):
    """Provided a long merchant name replace it with short name."""
    processed_row = re.sub(r'\s+|\.', '', row.lower()).strip()
    for key, value in MERCHANT_NAME_DICT.items():
        if key in processed_row:
            return value

    return row

frame['MERCHANT_NAME'] = frame['MERCHANT_NAME'].astype(str)
frame.MERCHANT_NAME = frame.MERCHANT_NAME.apply(lambda row: replace_merchant_name(row))

But i wanted to use NLP Logic and make it a generic function ( Instead of using values for mapping ). Just call the generic function and run it on any similar data column and get the desired results. I am pretty new to NLP Concepts, so looking for some help on it friends.

NOTE : Basically i wanted a generic NLP way coding to find all similar words from a given column ( or in a list ).

like image 988
pyds_learner Avatar asked May 08 '19 07:05

pyds_learner


2 Answers

If you don't have a golden set of "right" merchant names, this sounds like a clustering problem. It can be solved with a clever distance function (like Jaro-Winkler from the Jindrich's answer) and a simple clustering algorithm (e.g. agglomerative clustering).

After having clustered the texts, you can find the most representative text from each cluster and replace with it the whole cluster.

import numpy as np
import re
import textdistance
# we will need scikit-learn>=0.21
from sklearn.cluster import AgglomerativeClustering  

texts = [
  'Carrefour supermarket', 'Carrefour hypermarket', 'Carrefour', 'carrefour', 'Carrfour downtown', 'Carrfor market', 
  'Lulu', 'Lulu Hyper', 'Lulu dxb', 'lulu airport', 
  'k.m trading', 'KM Trading', 'KM trade', 'K.M.  Trading', 'KM.Trading'
]

def normalize(text):
  """ Keep only lower-cased text and numbers"""
  return re.sub('[^a-z0-9]+', ' ', text.lower())

def group_texts(texts, threshold=0.4): 
  """ Replace each text with the representative of its cluster"""
  normalized_texts = np.array([normalize(text) for text in texts])
  distances = 1 - np.array([
      [textdistance.jaro_winkler(one, another) for one in normalized_texts] 
      for another in normalized_texts
  ])
  clustering = AgglomerativeClustering(
    distance_threshold=threshold, # this parameter needs to be tuned carefully
    affinity="precomputed", linkage="complete", n_clusters=None
  ).fit(distances)
  centers = dict()
  for cluster_id in set(clustering.labels_):
    index = clustering.labels_ == cluster_id
    centrality = distances[:, index][index].sum(axis=1)
    centers[cluster_id] = normalized_texts[index][centrality.argmin()]
  return [centers[i] for i in clustering.labels_]

print(group_texts(texts))

The code above will print its output as

['carrefour', 'carrefour', 'carrefour', 'carrefour', 'carrefour', 'carrefour', 
 'lulu', 'lulu', 'lulu', 'lulu', 
 'km trading', 'km trading', 'km trading', 'km trading', 'km trading']

As a baseline, this function will do. You may want to improve it by modifying the distance function so that it would reflect your domain more closely. For example:

  • take into account synonyms: supermarket=hypermarket=market
  • lemmatize words (so that trading=trade)
  • give smaller weight to the non-important words (IDF?)

Unfortunately, most of such adjustments would be domain-specific, so you will have to tune them to your own dataset.

like image 155
David Dale Avatar answered Oct 17 '22 04:10

David Dale


You can do something as: for each word that is not in a spellchecking vocabulary (and thus is likely to be misspelled), have a look in your list of merchant names and have a look if there is a name with a small edit distance. You can also somehow normalize the words for the similarity search, i.e., lowercase everything and remove punctuation.

You can use textdistance package that implements plenty of string distances. I would probably use Jaro-Winkler distance for this purpose.

import string
import textdistance

MERCHANT_NAMES = [("lulu", "Lulu"), ("carrefour", "Carrefour"),  ("km", "KM")]
DISTANCE_THRESHOLD = 0.9

def normalize(orig_name):
    name_sig = orig_name.translate(str.maketrans('', '', string.punctuation)).lower()

    best_score = DISTANCE_THRESHOLD
    replacement = name

    for sig, name in MERCHANT_NAMES:
        distance = textdistance.jaro_winkler(name_sig, sig)
        if distance > best_score:
            best_score = distance
            replacement = name
    return replacement

You will probably need to tune what is an acceptable threshold for word replacement and do something with multi-word expressions. (E.g., throw away words similar to "supermarket", "hypermarket", etc.)

like image 5
Jindřich Avatar answered Oct 17 '22 05:10

Jindřich