Find the similarity between two string columns of a DataFrame

Tags:

I am new to programming.I have a pandas data frame in which two string columns are present.

Data frame is like below:

Col-1             Col-2
Update            have a account
Account           account summary
AccountDTH        Cancel
Balance           Balance Summary
Credit Card       Update credit card

Here i need to check the similarity of Col-2 elements with each element of Col-1. It Means i have to compare have a account with all the elements of Col-1. Then find the top 3 similar one. Suppose the similarity scores are :Account(85),AccountDTH(80),Balance(60),Update(45),Credit Card(35).

Expected Output is:

Col-2              Output
have a account     Account(85),AccountDTH(80),Balance(60)

202

asked Aug 20 '18 15:08

PANDA

1 Answers

You can use a Python library like fuzzywuzzy here, which has support for this type of task:

from fuzzywuzzy import process

df.assign(Output=[process.extract(i, df['Col-1'], limit=3) for i in df['Col-2']])

Using the process method, we can get string similary scores, and then pick the top 3, if 3 exist:

The output of the above code:

         Col-1               Col-2                                                         Output
0       Update      have a account       [(Account, 90, 1), (AccountDTH, 64, 2), (Update, 40, 0)]
1      Account     account summary  [(Account, 90, 1), (AccountDTH, 63, 2), (Credit Card, 38, 4)]
2   AccountDTH              Cancel      [(Balance, 62, 3), (Credit Card, 43, 4), (Update, 33, 0)]
3      Balance     Balance Summary      [(Balance, 90, 3), (Credit Card, 38, 4), (Update, 30, 0)]
4  Credit Card  Update credit card   [(Update, 90, 0), (Credit Card, 90, 4), (AccountDTH, 27, 2)]

To speed this comparison up (natively it uses Python's sequence matcher), I would recommend installing python-Levenshtein

170

answered Nov 14 '22 21:11

user3483203

Related questions
                            
                                IPython magic print variables on assignment
                            
                                Spyder IDE complaining about unable to detect undefined names
                            
                                Finding two most far away points in plot with many points in Python
                            
                                How start start celery worker in Django project
                            
                                dask dataframe head() returns empty df
                            
                                Write multiple lines of text in a flow with reportlab
                            
                                Winsorizing data by column in pandas with NaN
                            
                                How to merge an aggregate output back to original dataframe
                            
                                Change column with string of percent to float pandas dataframe
                            
                                Schema names in psycopg2 identifiers
                            
                                Create empy pandas DataFrame with DateTimeIndex for random time delta values
                            
                                Pandas read data from a secure FTP server in Python 3
                            
                                Date slider with Plotly Dash does not work
                            
                                using Case in django
                            
                                Python: Pandas Concatenate each row into a string
                            
                                How do I change a button label created with 'interact_manual' from 'ipywidgets'? and how do I change the size and color of that button?
                            
                                Converting a column of minutes to hours and minutes python
                            
                                Keras, TensorFlow : "TypeError: Cannot interpret feed_dict key as Tensor"
                            
                                joblib parallel processing of a multiple return values function
                            
                                Remove top row from a dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find the similarity between two string columns of a DataFrame

Tags:

python

string

pandas

nlp

similarity

PANDA

People also ask

1 Answers

user3483203

Recent Activity

Donate For Us