Finding partial matches between two different dataframes' columns, and assigning values when matches are found

Question

I want to fill df1 dataframe's "Category" column with the correct values from df2 dataframe's "Category" column.

import pandas as pd

df1 = pd.DataFrame({"Receiver": ["Insurance company", "Shop", "Pizza place", "Library", "Gas station 24/7", "Something else", "Whatever receiver"], "Category": ["","","","","","",""]}) 
df2 = pd.DataFrame({"Category": ["Insurances", "Groceries", "Groceries", "Fastfood", "Fastfood", "Car"], "Searchterm": ["Insurance", "Shop", "Market", "Pizza", "Burger", "Gas"]})

Output:

df1
Receiver                Category
0   Insurance company   
1   Shop    
2   Pizza place 
3   Library 
4   Gas station 24/7    
5   Something else  
6   Whatever receiver   

df2
    Category    Searchterm
0   Insurances  Insur
1   Groceries   Shop
2   Groceries   Market
3   Fastfood    Pizza
4   Fastfood    Burger
5   Car         Gas

I want to compare df1["Receiver"] to df2["Searchterm"] row by row, and where the latter even partially matches the former, assign that row's df2["Category"] to df1["Category"].

For example, "Pizza" in df2["Searchterm"] partially matches "Pizza place" in df1["Receiver"], so I want to assign "Fastfood" (which is Pizza's category in df2["Category"]) to the "Pizza place"'s category in df1["Category"].

The desired output would be:

df1
Receiver                Category
0   Insurance company   Insurances
1   Shop                Groceries
2   Pizza place         Fastfood
3   Library             
4   Gas station 24/7    Car
5   Something else      
6   Whatever receiver

So how can I fill df1["Category"]with the right categories? Thank you.

jpp · Accepted Answer

Iterate categories

Under the assumption the number of categories is small relative to the number of receivers, one strategy is to iterate categories. With this solution, note the last match only will stick where multiple categories are found.

for tup in df2.itertuples(index=False):
    mask = df1['Receiver'].str.contains(tup.Searchterm, regex=False)
    df1.loc[mask, 'Category'] = tup.Category

print(df1)

#      Category           Receiver
# 0  Insurances  Insurance company
# 1   Groceries               Shop
# 2    Fastfood        Pizza place
# 3                        Library
# 4         Car   Gas station 24/7
# 5                 Something else
# 6              Whatever receiver

Performance benchmarking

As noted, this solution scales better with rows in df1 than with categories in df2. To illustrate, consider performance below for differently sized input dataframes.

def jpp(df1, df2):
    for tup in df2.itertuples(index=False):
        df1.loc[df1['Receiver'].str.contains(tup.Searchterm, regex=False), 'Category'] = tup.Category
    return df1

def user347(df1, df2):
    df1['Category'] = df1['Receiver'].replace((df2['Searchterm'] + r'.*').values,
                                              df2['Category'].values,
                                              regex=True)
    df1.loc[df1['Receiver'].isin(df1['Category']), 'Category'] = ''
    return df1

df1 = pd.concat([df1]*10**4, ignore_index=True)
df2 = pd.concat([df2], ignore_index=True)

%timeit jpp(df1, df2)      # 145 ms per loop
%timeit user347(df1, df2)  # 364 ms per loop

df1 = pd.concat([df1], ignore_index=True)
df2 = pd.concat([df2]*100, ignore_index=True)

%timeit jpp(df1, df2)      # 666 ms per loop
%timeit user347(df1, df2)  # 88 ms per loop

Vaishali · Answer

One more solution using str.extract

pat = '('+'|'.join(df2['Searchterm'])+')'
df1["Category"] = df1['Receiver'].str.extract(pat)[0].map(df2.set_index('Searchterm')['Category'].to_dict()).fillna('')

    Receiver            Category
0   Insurance company   Insurances
1   Shop                Groceries
2   Pizza place         Fastfood
3   Library 
4   Gas station 24/7    Car
5   Something else  
6   Whatever receiver

Performance Benchmarking

def jpp(df1, df2):
    for tup in df2.itertuples(index=False):
        df1.loc[df1['Receiver'].str.contains(tup.Searchterm, regex=False), 'Category'] = tup.Category
    return df1

def user347(df1, df2):
    df1['Category'] = df1['Receiver'].replace((df2['Searchterm'] + r'.*').values,
                                              df2['Category'].values,
                                              regex=True)
    df1.loc[df1['Receiver'].isin(df1['Category']), 'Category'] = ''
    return df1

def vai(df1, df2):
    pat = '('+'|'.join(df2['Searchterm'])+')'
    df1["Category"] = df1['Receiver'].str.extract(pat)[0].map(df2.set_index('Searchterm')['Category'].to_dict()).fillna('')

df1 = pd.concat([df1]*10**4, ignore_index=True)
df2 = pd.concat([df2], ignore_index=True)

%timeit jpp(df1, df2)    
%timeit user347(df1, df2)
%timeit vai(df1, df2)


120 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
221 ms ± 4.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
78.2 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

df1 = pd.concat([df1], ignore_index=True)
df2 = pd.concat([df2]*100, ignore_index=True)

%timeit jpp(df1, df2)
%timeit user347(df1, df2)
%timeit vai(df1, df2)

11.4 s ± 276 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
20.4 s ± 296 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
98.3 ms ± 408 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Finding partial matches between two different dataframes' columns, and assigning values when matches are found

Tags:

python

string-matching

pandas

dataframe

KMFR

2 Answers

Iterate categories

Performance benchmarking

jpp

Performance Benchmarking

Vaishali

Recent Activity

Donate For Us

Finding partial matches between two different dataframes' columns, and assigning values when matches are found

Tags:

python

string-matching

pandas

dataframe

KMFR

2 Answers

Iterate categories

Performance benchmarking

jpp

Performance Benchmarking

Vaishali

Related questions

Recent Activity

Donate For Us