I want to fill df1 dataframe's "Category" column with the correct values from df2 dataframe's "Category" column.
import pandas as pd
df1 = pd.DataFrame({"Receiver": ["Insurance company", "Shop", "Pizza place", "Library", "Gas station 24/7", "Something else", "Whatever receiver"], "Category": ["","","","","","",""]})
df2 = pd.DataFrame({"Category": ["Insurances", "Groceries", "Groceries", "Fastfood", "Fastfood", "Car"], "Searchterm": ["Insurance", "Shop", "Market", "Pizza", "Burger", "Gas"]})
Output:
df1
Receiver Category
0 Insurance company
1 Shop
2 Pizza place
3 Library
4 Gas station 24/7
5 Something else
6 Whatever receiver
df2
Category Searchterm
0 Insurances Insur
1 Groceries Shop
2 Groceries Market
3 Fastfood Pizza
4 Fastfood Burger
5 Car Gas
I want to compare df1["Receiver"] to df2["Searchterm"] row by row, and where the latter even partially matches the former, assign that row's df2["Category"] to df1["Category"].
For example, "Pizza" in df2["Searchterm"] partially matches "Pizza place" in df1["Receiver"], so I want to assign "Fastfood" (which is Pizza's category in df2["Category"]) to the "Pizza place"'s category in df1["Category"].
The desired output would be:
df1
Receiver Category
0 Insurance company Insurances
1 Shop Groceries
2 Pizza place Fastfood
3 Library
4 Gas station 24/7 Car
5 Something else
6 Whatever receiver
So how can I fill df1["Category"]with the right categories? Thank you.
Under the assumption the number of categories is small relative to the number of receivers, one strategy is to iterate categories. With this solution, note the last match only will stick where multiple categories are found.
for tup in df2.itertuples(index=False):
mask = df1['Receiver'].str.contains(tup.Searchterm, regex=False)
df1.loc[mask, 'Category'] = tup.Category
print(df1)
# Category Receiver
# 0 Insurances Insurance company
# 1 Groceries Shop
# 2 Fastfood Pizza place
# 3 Library
# 4 Car Gas station 24/7
# 5 Something else
# 6 Whatever receiver
As noted, this solution scales better with rows in df1 than with categories in df2. To illustrate, consider performance below for differently sized input dataframes.
def jpp(df1, df2):
for tup in df2.itertuples(index=False):
df1.loc[df1['Receiver'].str.contains(tup.Searchterm, regex=False), 'Category'] = tup.Category
return df1
def user347(df1, df2):
df1['Category'] = df1['Receiver'].replace((df2['Searchterm'] + r'.*').values,
df2['Category'].values,
regex=True)
df1.loc[df1['Receiver'].isin(df1['Category']), 'Category'] = ''
return df1
df1 = pd.concat([df1]*10**4, ignore_index=True)
df2 = pd.concat([df2], ignore_index=True)
%timeit jpp(df1, df2) # 145 ms per loop
%timeit user347(df1, df2) # 364 ms per loop
df1 = pd.concat([df1], ignore_index=True)
df2 = pd.concat([df2]*100, ignore_index=True)
%timeit jpp(df1, df2) # 666 ms per loop
%timeit user347(df1, df2) # 88 ms per loop
One more solution using str.extract
pat = '('+'|'.join(df2['Searchterm'])+')'
df1["Category"] = df1['Receiver'].str.extract(pat)[0].map(df2.set_index('Searchterm')['Category'].to_dict()).fillna('')
Receiver Category
0 Insurance company Insurances
1 Shop Groceries
2 Pizza place Fastfood
3 Library
4 Gas station 24/7 Car
5 Something else
6 Whatever receiver
def jpp(df1, df2):
for tup in df2.itertuples(index=False):
df1.loc[df1['Receiver'].str.contains(tup.Searchterm, regex=False), 'Category'] = tup.Category
return df1
def user347(df1, df2):
df1['Category'] = df1['Receiver'].replace((df2['Searchterm'] + r'.*').values,
df2['Category'].values,
regex=True)
df1.loc[df1['Receiver'].isin(df1['Category']), 'Category'] = ''
return df1
def vai(df1, df2):
pat = '('+'|'.join(df2['Searchterm'])+')'
df1["Category"] = df1['Receiver'].str.extract(pat)[0].map(df2.set_index('Searchterm')['Category'].to_dict()).fillna('')
df1 = pd.concat([df1]*10**4, ignore_index=True)
df2 = pd.concat([df2], ignore_index=True)
%timeit jpp(df1, df2)
%timeit user347(df1, df2)
%timeit vai(df1, df2)
120 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
221 ms ± 4.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
78.2 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df1 = pd.concat([df1], ignore_index=True)
df2 = pd.concat([df2]*100, ignore_index=True)
%timeit jpp(df1, df2)
%timeit user347(df1, df2)
%timeit vai(df1, df2)
11.4 s ± 276 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
20.4 s ± 296 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
98.3 ms ± 408 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With