I'm trying to categorise a list of strings called keywords
based on whether they contain 1 or more substrings from another DataFrame.
The substring is weighted so that at the end each keyword
has a weight in approximately 4 columns. These columns will be called loan
, mortgage
, accounts
, and cards
.
I've attempted to use join()
, concat()
and merge()
to join multiple dataframes, as well as attempting to use where()
, isin()
and contains()
.
I've come close a few times to achieve my desired goal but thwarted either by errors or scalability. I have limited experience with Python and Pandas, so there may be a completely different approach that I've not considered.
import pandas as pd
df = pd.read_csv('Inputs/keyword.csv', header=0)
df['loan'] = 0
df2 = pd.read_csv('Essentials/Groups/loans.csv', header=0)
#review the data in df
print(df.head())
output:
keywords search_volume loan
0 loans 132000 0
1 funding circle 81000 0
2 government 36000 0
3 short term loans 30000 0
4 company 27000 0
In the above code I'm loading up my list of keywords including an associated search volume. It currently has no loan column so I've added one with the default value set to 0.
Below, I have another DataFrame that includes a list of terms and an associated weighting. I've arbitrarily picked an integer of 5 that I would like added to the existing total.
#review the data in df2
print(df2.head())
output:
terms weight
0 loan 5
1 loans 5
2 personal 3
3 business 3
4 apr 4
There is a problem that I have spotted but don't know how to resolve.
My list includes both loan
and loans
. It is fine for me to have these duplicates on my end. However, row 3 has a value of 'short term loans' and is marked as False.
Since the term loan
and loans
both appear in short term loans
, I would have expected it to be marked as True. I've tried reversing the .isin()
statement so that I'm searching df2['terms']
in df['keywords']
, but the result came out the same.
#review the true/false for overlaps
print(df['keywords'].isin(df2['terms']).head())
output:
0 True
1 False
2 False
3 False
4 False
Lastly, once we fix this boolean issue, I don't know how to vectorise a sum change in df['loan']
based on matches. I'm trying to avoid for loops because I'm expecting the keyword list to include around 100,000+ rows and each category DataFrame may include 1,000 terms.
The desired output would look something like this:
output:
keywords search_volume loan mortgage accounts cards
0 loans 132000 10 0 0 0
1 funding circle 81000 0 0 0 0
2 government 36000 0 0 0 0
3 short term loans 30000 10 0 0 0
4 company 27000 0 0 0 0
Considering the df1
as :
keywords search_volume
0 loans 132000
1 funding circle 81000
2 government 36000
3 short term loans 30000
4 company 27000
you can take help of series.str.extract()
and df.assign()
we can do something like:
d=df2.set_index('terms')['weight']
pat=r'({})'.format('|'.join(df2.terms))
#'(loan|loans|personal|business|apr)'
df1=df1.assign(**{'term_match':df1.keywords.str.extract(pat,expand=False),
'weight':df1.keywords.str.extract(pat,expand=False).map(d)})
print(df1)
Output
keywords search_volume term_match weight
0 loans 132000 loan 5.0
1 funding circle 81000 NaN NaN
2 government 36000 NaN NaN
3 short term loans 30000 loan 5.0
4 company 27000 NaN NaN
EDIT
To find all matching strings lets update df2
as: df2.loc[5]=['term',3]
just for testing.
Then using series.str.findall()
:
s=df1.keywords.str.findall(pat)
df1=df1.assign(**{'term_match':s.apply(','.join),
'weight':s.apply(lambda x: sum([d.get(item,item) for item in x]))})
print(df1)
keywords search_volume loan term_match weight
0 loans 132000 0 loan 5
1 funding circle 81000 0 0
2 government 36000 0 0
3 short term loans 30000 0 term,loan 8
4 company 27000 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With