Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Merge two pandas DataFrame based on partial match




Two DataFrames have city names that are not formatted the same way. I'd like to do a Left-outer join and pull geo field for all partial string matches between the field City in both DataFrames.

import pandas as pd

df1 = pd.DataFrame({
                    'City': ['San Francisco, CA','Oakland, CA'], 
                    'Val': [1,2]

df2 = pd.DataFrame({
                    'City': ['San Francisco-Oakland, CA','Salinas, CA'], 
                    'Geo': ['geo1','geo2']

Expected DataFrame upon join:

 City                   Val   Geo

 San Francisco, CA      1     geo1
 Oakland, CA            2     geo1
like image 867
kms Avatar asked Sep 09 '21 23:09


People also ask

How to merge two pandas DataFrames by matched ID number?

Merge two Pandas dataframes by matched ID number. 1 Create a first data frame. 2 Create a second data frame. 3 Select Column to be matched. 4 Merge using the merge function Syntax : DataFrame.merge (parameters)

How do you combine data in a Dataframe?

You have now learned the three most important techniques for combining data in Pandas: merge () for combining data on common columns or indices. .join () for combining data on a key column or an index. concat () for combining DataFrames across rows or columns.

How do I join two Dataframe columns in Python?

If you want to join on columns like you would with merge(), then you’ll need to set the columns as indices. Like merge(), .join() has a few parameters that give you more flexibility in your joins. However, with .join(), the list of parameters is relatively short: other: This is the only required parameter. It defines the other DataFrame to join.

How to get the content of a Dataframe with multiple matching words?

if there is a match, then return the matching word as a separate column of the df dataframe (e.g. df ['matchedName']) if there are multiple matches, then create a list of matching words to the corresponding entry of df ['content']

2 Answers

Update: the fuzzywuzzy project has been renamed to thefuzz and moved here

You can use thefuzz package and the function extractOne:

# Python env: pip install thefuzz
# Anaconda env: pip install thefuzz
# -> thefuzz is not yet available on Anaconda (2021-09-18)
# -> you can use the old package: conda install -c conda-forge fuzzywuzzy

from thefuzz import process

best_city = lambda x: process.extractOne(x, df2["City"])[2]  # See note below
df1['Geo'] = df2.loc[df1["City"].map(best_city).values, 'Geo'].values


>>> df1
                City  Val   Geo
0  San Francisco, CA    1  geo1
1        Oakland, CA    2  geo1

Note: extractOne return a tuple of 3 values from the best match: the City name from df2 [0], the accuracy score [1] and the index [2] (<- the one I use).

like image 107
Corralien Avatar answered Sep 17 '22 14:09


This should do the job. String match with Levenshtein_distance.

pip install thefuzz[speedup]

import pandas as pd
import numpy as np

from thefuzz import process

def fuzzy_match(
    a: pd.DataFrame, b: pd.DataFrame, col: str, limit: int = 5, thresh: int = 80
    """use fuzzy matching to join on column"""

    s = b[col].tolist()

    matches = a[col].apply(lambda x: process.extract(x, s, limit=limit))
    matches = pd.DataFrame(np.concatenate(matches), columns=["match", "score"])

    # join other columns in b to matches
    to_join = (
        pd.merge(left=b, right=matches, how="right", left_on="City", right_on="match")
        .set_index(  # create an index that represents the matching row in df a, you can drop this when `limit=1`
                    np.repeat(i, limit if limit < len(b) else len(b))
                    for i in range(len(a))
        .astype({"score": "int16"})
    print(f"\t the index here represents the row in dataframe a on which to join")

    res = pd.merge(
        left=a, right=to_join, left_index=True, right_index=True, suffixes=("", "_b")

    # return only the highest match or you can just set the limit to 1
    # and remove this
    df = res.reset_index()
    df = df.iloc[df.groupby(by="index")["score"].idxmax()].reset_index(drop=True)

    return df.drop(columns=["City_b", "score", "index"])

def test(df):

    expected = pd.DataFrame(
            "City": ["San Francisco, CA", "Oakland, CA"],
            "Val": [1, 2],
            "Geo": ["geo1", "geo1"],



    assert expected.equals(df)

if __name__ == "__main__":

    a = pd.DataFrame({"City": ["San Francisco, CA", "Oakland, CA"], "Val": [1, 2]})
    b = pd.DataFrame(
        {"City": ["San Francisco-Oakland, CA", "Salinas, CA"], "Geo": ["geo1", "geo2"]}

    print(f'\n\n{"fuzzy match":-^70}')
    res = fuzzy_match(a, b, col="City")

like image 36
Ian Zurutuza Avatar answered Sep 18 '22 14:09

Ian Zurutuza