Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the similarity between two string columns of a DataFrame

I am new to programming.I have a pandas data frame in which two string columns are present.

Data frame is like below:

Col-1             Col-2
Update            have a account
Account           account summary
AccountDTH        Cancel
Balance           Balance Summary
Credit Card       Update credit card

Here i need to check the similarity of Col-2 elements with each element of Col-1. It Means i have to compare have a account with all the elements of Col-1. Then find the top 3 similar one. Suppose the similarity scores are :Account(85),AccountDTH(80),Balance(60),Update(45),Credit Card(35).

Expected Output is:

Col-2              Output
have a account     Account(85),AccountDTH(80),Balance(60)
like image 202
PANDA Avatar asked Aug 20 '18 15:08

PANDA


People also ask

How do I find similar columns in pandas?

To find duplicate columns we need to iterate through all columns of a DataFrame and for each and every column it will search if any other column exists in DataFrame with the same contents already. If yes then that column name will be stored in the duplicate column set.

How do you find similar rows in a DataFrame?

DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.


1 Answers

You can use a Python library like fuzzywuzzy here, which has support for this type of task:

from fuzzywuzzy import process

df.assign(Output=[process.extract(i, df['Col-1'], limit=3) for i in df['Col-2']])

Using the process method, we can get string similary scores, and then pick the top 3, if 3 exist:

The output of the above code:

         Col-1               Col-2                                                         Output
0       Update      have a account       [(Account, 90, 1), (AccountDTH, 64, 2), (Update, 40, 0)]
1      Account     account summary  [(Account, 90, 1), (AccountDTH, 63, 2), (Credit Card, 38, 4)]
2   AccountDTH              Cancel      [(Balance, 62, 3), (Credit Card, 43, 4), (Update, 33, 0)]
3      Balance     Balance Summary      [(Balance, 90, 3), (Credit Card, 38, 4), (Update, 30, 0)]
4  Credit Card  Update credit card   [(Update, 90, 0), (Credit Card, 90, 4), (AccountDTH, 27, 2)]

To speed this comparison up (natively it uses Python's sequence matcher), I would recommend installing python-Levenshtein

like image 170
user3483203 Avatar answered Nov 14 '22 21:11

user3483203