Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding partial matches between two different dataframes' columns, and assigning values when matches are found

I want to fill df1 dataframe's "Category" column with the correct values from df2 dataframe's "Category" column.

import pandas as pd

df1 = pd.DataFrame({"Receiver": ["Insurance company", "Shop", "Pizza place", "Library", "Gas station 24/7", "Something else", "Whatever receiver"], "Category": ["","","","","","",""]}) 
df2 = pd.DataFrame({"Category": ["Insurances", "Groceries", "Groceries", "Fastfood", "Fastfood", "Car"], "Searchterm": ["Insurance", "Shop", "Market", "Pizza", "Burger", "Gas"]})

Output:

df1
Receiver                Category
0   Insurance company   
1   Shop    
2   Pizza place 
3   Library 
4   Gas station 24/7    
5   Something else  
6   Whatever receiver   

df2
    Category    Searchterm
0   Insurances  Insur
1   Groceries   Shop
2   Groceries   Market
3   Fastfood    Pizza
4   Fastfood    Burger
5   Car         Gas

I want to compare df1["Receiver"] to df2["Searchterm"] row by row, and where the latter even partially matches the former, assign that row's df2["Category"] to df1["Category"].

For example, "Pizza" in df2["Searchterm"] partially matches "Pizza place" in df1["Receiver"], so I want to assign "Fastfood" (which is Pizza's category in df2["Category"]) to the "Pizza place"'s category in df1["Category"].

The desired output would be:

df1
Receiver                Category
0   Insurance company   Insurances
1   Shop                Groceries
2   Pizza place         Fastfood
3   Library             
4   Gas station 24/7    Car
5   Something else      
6   Whatever receiver   

So how can I fill df1["Category"]with the right categories? Thank you.

like image 963
KMFR Avatar asked Feb 26 '26 01:02

KMFR


2 Answers

Iterate categories

Under the assumption the number of categories is small relative to the number of receivers, one strategy is to iterate categories. With this solution, note the last match only will stick where multiple categories are found.

for tup in df2.itertuples(index=False):
    mask = df1['Receiver'].str.contains(tup.Searchterm, regex=False)
    df1.loc[mask, 'Category'] = tup.Category

print(df1)

#      Category           Receiver
# 0  Insurances  Insurance company
# 1   Groceries               Shop
# 2    Fastfood        Pizza place
# 3                        Library
# 4         Car   Gas station 24/7
# 5                 Something else
# 6              Whatever receiver

Performance benchmarking

As noted, this solution scales better with rows in df1 than with categories in df2. To illustrate, consider performance below for differently sized input dataframes.

def jpp(df1, df2):
    for tup in df2.itertuples(index=False):
        df1.loc[df1['Receiver'].str.contains(tup.Searchterm, regex=False), 'Category'] = tup.Category
    return df1

def user347(df1, df2):
    df1['Category'] = df1['Receiver'].replace((df2['Searchterm'] + r'.*').values,
                                              df2['Category'].values,
                                              regex=True)
    df1.loc[df1['Receiver'].isin(df1['Category']), 'Category'] = ''
    return df1

df1 = pd.concat([df1]*10**4, ignore_index=True)
df2 = pd.concat([df2], ignore_index=True)

%timeit jpp(df1, df2)      # 145 ms per loop
%timeit user347(df1, df2)  # 364 ms per loop

df1 = pd.concat([df1], ignore_index=True)
df2 = pd.concat([df2]*100, ignore_index=True)

%timeit jpp(df1, df2)      # 666 ms per loop
%timeit user347(df1, df2)  # 88 ms per loop
like image 82
jpp Avatar answered Feb 27 '26 15:02

jpp


One more solution using str.extract

pat = '('+'|'.join(df2['Searchterm'])+')'
df1["Category"] = df1['Receiver'].str.extract(pat)[0].map(df2.set_index('Searchterm')['Category'].to_dict()).fillna('')

    Receiver            Category
0   Insurance company   Insurances
1   Shop                Groceries
2   Pizza place         Fastfood
3   Library 
4   Gas station 24/7    Car
5   Something else  
6   Whatever receiver   

Performance Benchmarking

def jpp(df1, df2):
    for tup in df2.itertuples(index=False):
        df1.loc[df1['Receiver'].str.contains(tup.Searchterm, regex=False), 'Category'] = tup.Category
    return df1

def user347(df1, df2):
    df1['Category'] = df1['Receiver'].replace((df2['Searchterm'] + r'.*').values,
                                              df2['Category'].values,
                                              regex=True)
    df1.loc[df1['Receiver'].isin(df1['Category']), 'Category'] = ''
    return df1

def vai(df1, df2):
    pat = '('+'|'.join(df2['Searchterm'])+')'
    df1["Category"] = df1['Receiver'].str.extract(pat)[0].map(df2.set_index('Searchterm')['Category'].to_dict()).fillna('')

df1 = pd.concat([df1]*10**4, ignore_index=True)
df2 = pd.concat([df2], ignore_index=True)

%timeit jpp(df1, df2)    
%timeit user347(df1, df2)
%timeit vai(df1, df2)


120 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
221 ms ± 4.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
78.2 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

df1 = pd.concat([df1], ignore_index=True)
df2 = pd.concat([df2]*100, ignore_index=True)

%timeit jpp(df1, df2)
%timeit user347(df1, df2)
%timeit vai(df1, df2)

11.4 s ± 276 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
20.4 s ± 296 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
98.3 ms ± 408 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
like image 45
Vaishali Avatar answered Feb 27 '26 15:02

Vaishali



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!