I am new to programming.I have a pandas data frame in which two string columns are present.
Data frame is like below:
Col-1             Col-2
Update            have a account
Account           account summary
AccountDTH        Cancel
Balance           Balance Summary
Credit Card       Update credit card
Here i  need to check the similarity of Col-2 elements with each element of Col-1.
It Means i have to compare have a account with all the elements of Col-1.
Then find the top 3 similar one. Suppose the similarity scores are :Account(85),AccountDTH(80),Balance(60),Update(45),Credit Card(35).
Expected Output is:
Col-2              Output
have a account     Account(85),AccountDTH(80),Balance(60)
                To find duplicate columns we need to iterate through all columns of a DataFrame and for each and every column it will search if any other column exists in DataFrame with the same contents already. If yes then that column name will be stored in the duplicate column set.
DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.
You can use a Python library like fuzzywuzzy here, which has support for this type of task:
from fuzzywuzzy import process
df.assign(Output=[process.extract(i, df['Col-1'], limit=3) for i in df['Col-2']])
Using the process method, we can get string similary scores, and then pick the top 3, if 3 exist:
The output of the above code:
         Col-1               Col-2                                                         Output
0       Update      have a account       [(Account, 90, 1), (AccountDTH, 64, 2), (Update, 40, 0)]
1      Account     account summary  [(Account, 90, 1), (AccountDTH, 63, 2), (Credit Card, 38, 4)]
2   AccountDTH              Cancel      [(Balance, 62, 3), (Credit Card, 43, 4), (Update, 33, 0)]
3      Balance     Balance Summary      [(Balance, 90, 3), (Credit Card, 38, 4), (Update, 30, 0)]
4  Credit Card  Update credit card   [(Update, 90, 0), (Credit Card, 90, 4), (AccountDTH, 27, 2)]
To speed this comparison up (natively it uses Python's sequence matcher), I would recommend installing python-Levenshtein
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With