Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match Substrings between Multiple Dataframes and Sum Weighted Value in Separate Column

I'm trying to categorise a list of strings called keywords based on whether they contain 1 or more substrings from another DataFrame.

The substring is weighted so that at the end each keyword has a weight in approximately 4 columns. These columns will be called loan, mortgage, accounts, and cards.

I've attempted to use join(), concat() and merge() to join multiple dataframes, as well as attempting to use where(), isin() and contains().

I've come close a few times to achieve my desired goal but thwarted either by errors or scalability. I have limited experience with Python and Pandas, so there may be a completely different approach that I've not considered.

import pandas as pd

df = pd.read_csv('Inputs/keyword.csv', header=0)
df['loan'] = 0
df2 = pd.read_csv('Essentials/Groups/loans.csv', header=0)

#review the data in df
print(df.head())

output:
           keywords  search_volume  loan
0             loans         132000     0
1    funding circle          81000     0
2        government          36000     0
3  short term loans          30000     0
4           company          27000     0

In the above code I'm loading up my list of keywords including an associated search volume. It currently has no loan column so I've added one with the default value set to 0.

Below, I have another DataFrame that includes a list of terms and an associated weighting. I've arbitrarily picked an integer of 5 that I would like added to the existing total.

#review the data in df2
print(df2.head())

output:
      terms  weight
0      loan       5
1     loans       5
2  personal       3
3  business       3
4       apr       4

There is a problem that I have spotted but don't know how to resolve.

My list includes both loan and loans. It is fine for me to have these duplicates on my end. However, row 3 has a value of 'short term loans' and is marked as False.

Since the term loan and loans both appear in short term loans, I would have expected it to be marked as True. I've tried reversing the .isin() statement so that I'm searching df2['terms'] in df['keywords'], but the result came out the same.

#review the true/false for overlaps
print(df['keywords'].isin(df2['terms']).head())

output:
0     True
1    False
2    False
3    False
4    False

Lastly, once we fix this boolean issue, I don't know how to vectorise a sum change in df['loan'] based on matches. I'm trying to avoid for loops because I'm expecting the keyword list to include around 100,000+ rows and each category DataFrame may include 1,000 terms.

The desired output would look something like this:

output:
           keywords  search_volume  loan  mortgage  accounts  cards
0             loans         132000     10        0         0      0
1    funding circle          81000     0         0         0      0
2        government          36000     0         0         0      0
3  short term loans          30000     10        0         0      0
4           company          27000     0         0         0      0
like image 938
Rowan Collins Avatar asked Mar 03 '23 22:03

Rowan Collins


1 Answers

Considering the df1 as :

           keywords  search_volume
0             loans         132000
1    funding circle          81000
2        government          36000
3  short term loans          30000
4           company          27000

you can take help of series.str.extract() and df.assign() we can do something like:

d=df2.set_index('terms')['weight']
pat=r'({})'.format('|'.join(df2.terms))
#'(loan|loans|personal|business|apr)'
df1=df1.assign(**{'term_match':df1.keywords.str.extract(pat,expand=False),
              'weight':df1.keywords.str.extract(pat,expand=False).map(d)})
print(df1)

Output

           keywords  search_volume term_match  weight
0             loans         132000       loan     5.0
1    funding circle          81000        NaN     NaN
2        government          36000        NaN     NaN
3  short term loans          30000       loan     5.0
4           company          27000        NaN     NaN

EDIT To find all matching strings lets update df2 as: df2.loc[5]=['term',3] just for testing.

Then using series.str.findall():

s=df1.keywords.str.findall(pat)
df1=df1.assign(**{'term_match':s.apply(','.join),
              'weight':s.apply(lambda x: sum([d.get(item,item)  for item in x]))})
print(df1)

           keywords  search_volume  loan term_match  weight
0             loans         132000     0       loan       5
1    funding circle          81000     0                  0
2        government          36000     0                  0
3  short term loans          30000     0  term,loan       8
4           company          27000     0                  0
like image 79
anky Avatar answered Apr 08 '23 13:04

anky