Merge two pandas DataFrame based on partial match

Tags:

python

pandas

Two DataFrames have city names that are not formatted the same way. I'd like to do a Left-outer join and pull geo field for all partial string matches between the field City in both DataFrames.

import pandas as pd

df1 = pd.DataFrame({
                    'City': ['San Francisco, CA','Oakland, CA'], 
                    'Val': [1,2]
                  })

df2 = pd.DataFrame({
                    'City': ['San Francisco-Oakland, CA','Salinas, CA'], 
                    'Geo': ['geo1','geo2']
                  })

Expected DataFrame upon join:

 City                   Val   Geo

 San Francisco, CA      1     geo1
 Oakland, CA            2     geo1

867

asked Sep 09 '21 23:09

kms

2 Answers

Update: the fuzzywuzzy project has been renamed to thefuzz and moved here

You can use thefuzz package and the function extractOne:

# Python env: pip install thefuzz
# Anaconda env: pip install thefuzz
# -> thefuzz is not yet available on Anaconda (2021-09-18)
# -> you can use the old package: conda install -c conda-forge fuzzywuzzy

from thefuzz import process

best_city = lambda x: process.extractOne(x, df2["City"])[2]  # See note below
df1['Geo'] = df2.loc[df1["City"].map(best_city).values, 'Geo'].values

Output:

>>> df1
                City  Val   Geo
0  San Francisco, CA    1  geo1
1        Oakland, CA    2  geo1

Note: extractOne return a tuple of 3 values from the best match: the City name from df2 [0], the accuracy score [1] and the index [2] (<- the one I use).

107

answered Sep 17 '22 14:09

Corralien

This should do the job. String match with Levenshtein_distance.

pip install thefuzz[speedup]

import pandas as pd
import numpy as np

from thefuzz import process

def fuzzy_match(
    a: pd.DataFrame, b: pd.DataFrame, col: str, limit: int = 5, thresh: int = 80
):
    """use fuzzy matching to join on column"""

    s = b[col].tolist()

    matches = a[col].apply(lambda x: process.extract(x, s, limit=limit))
    matches = pd.DataFrame(np.concatenate(matches), columns=["match", "score"])

    # join other columns in b to matches
    to_join = (
        pd.merge(left=b, right=matches, how="right", left_on="City", right_on="match")
        .set_index(  # create an index that represents the matching row in df a, you can drop this when `limit=1`
            np.array(
                list(
                    np.repeat(i, limit if limit < len(b) else len(b))
                    for i in range(len(a))
                )
            ).flatten()
        )
        .drop(columns=["match"])
        .astype({"score": "int16"})
    )
    print(f"\t the index here represents the row in dataframe a on which to join")
    print(to_join)

    res = pd.merge(
        left=a, right=to_join, left_index=True, right_index=True, suffixes=("", "_b")
    )

    # return only the highest match or you can just set the limit to 1
    # and remove this
    df = res.reset_index()
    df = df.iloc[df.groupby(by="index")["score"].idxmax()].reset_index(drop=True)

    return df.drop(columns=["City_b", "score", "index"])

def test(df):

    expected = pd.DataFrame(
        {
            "City": ["San Francisco, CA", "Oakland, CA"],
            "Val": [1, 2],
            "Geo": ["geo1", "geo1"],
        }
    )

    print(f'{"expected":-^70}')
    print(expected)

    print(f'{"res":-^70}')
    print(df)

    assert expected.equals(df)


if __name__ == "__main__":

    a = pd.DataFrame({"City": ["San Francisco, CA", "Oakland, CA"], "Val": [1, 2]})
    b = pd.DataFrame(
        {"City": ["San Francisco-Oakland, CA", "Salinas, CA"], "Geo": ["geo1", "geo2"]}
    )

    print(f'\n\n{"fuzzy match":-^70}')
    res = fuzzy_match(a, b, col="City")
    test(res)

answered Sep 18 '22 14:09

Ian Zurutuza

Related questions
                            
                                How to set up a Selenium Python environment for Firefox
                            
                                Using alembic.config.main redirects log output
                            
                                How to convert bitarray to an integer in python
                            
                                Keras embedding layers: how do they work?
                            
                                Remove duplicate rows from Pandas dataframe where only some columns have the same value
                            
                                Datetime in pandas dataframe will not subtract from each other
                            
                                Exact field search in the Django admin
                            
                                Pylint: Disable Unnecessary "else" after "return" (no-else-return) warning
                            
                                Use Django ORM outside of Django
                            
                                "cannot create temp dir for user data dir" error when not running as admin
                            
                                Celery beat not picking up periodic tasks
                            
                                python cannot import timezone but can import datetime
                            
                                Pandas and scikit-learn: KeyError: [....] not in index
                            
                                Pandas dataframe to JSONL (JSON Lines) conversion
                            
                                Cartesian product of two lists in python
                            
                                'Series' object has no attribute 'to_datetime'
                            
                                How to change tqdm's bar size
                            
                                Neural network for square (x^2) approximation
                            
                                ModuleNotFoundError: No module named 'plotly.graph_objects'
                            
                                How to rotate Selenium webrowser IP address

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With