Two DataFrames have city names that are not formatted the same way. I'd like to do a Left-outer join and pull geo
field for all partial string matches between the field City
in both DataFrames.
import pandas as pd
df1 = pd.DataFrame({
'City': ['San Francisco, CA','Oakland, CA'],
'Val': [1,2]
})
df2 = pd.DataFrame({
'City': ['San Francisco-Oakland, CA','Salinas, CA'],
'Geo': ['geo1','geo2']
})
Expected DataFrame
upon join:
City Val Geo
San Francisco, CA 1 geo1
Oakland, CA 2 geo1
Merge two Pandas dataframes by matched ID number. 1 Create a first data frame. 2 Create a second data frame. 3 Select Column to be matched. 4 Merge using the merge function Syntax : DataFrame.merge (parameters)
You have now learned the three most important techniques for combining data in Pandas: merge () for combining data on common columns or indices. .join () for combining data on a key column or an index. concat () for combining DataFrames across rows or columns.
If you want to join on columns like you would with merge(), then you’ll need to set the columns as indices. Like merge(), .join() has a few parameters that give you more flexibility in your joins. However, with .join(), the list of parameters is relatively short: other: This is the only required parameter. It defines the other DataFrame to join.
if there is a match, then return the matching word as a separate column of the df dataframe (e.g. df ['matchedName']) if there are multiple matches, then create a list of matching words to the corresponding entry of df ['content']
Update: the fuzzywuzzy
project has been renamed to thefuzz
and moved here
You can use thefuzz
package and the function extractOne
:
# Python env: pip install thefuzz
# Anaconda env: pip install thefuzz
# -> thefuzz is not yet available on Anaconda (2021-09-18)
# -> you can use the old package: conda install -c conda-forge fuzzywuzzy
from thefuzz import process
best_city = lambda x: process.extractOne(x, df2["City"])[2] # See note below
df1['Geo'] = df2.loc[df1["City"].map(best_city).values, 'Geo'].values
Output:
>>> df1
City Val Geo
0 San Francisco, CA 1 geo1
1 Oakland, CA 2 geo1
Note: extractOne
return a tuple of 3 values from the best match: the City name from df2
[0], the accuracy score [1] and the index [2] (<- the one I use).
This should do the job. String match with Levenshtein_distance.
pip install thefuzz[speedup]
import pandas as pd
import numpy as np
from thefuzz import process
def fuzzy_match(
a: pd.DataFrame, b: pd.DataFrame, col: str, limit: int = 5, thresh: int = 80
):
"""use fuzzy matching to join on column"""
s = b[col].tolist()
matches = a[col].apply(lambda x: process.extract(x, s, limit=limit))
matches = pd.DataFrame(np.concatenate(matches), columns=["match", "score"])
# join other columns in b to matches
to_join = (
pd.merge(left=b, right=matches, how="right", left_on="City", right_on="match")
.set_index( # create an index that represents the matching row in df a, you can drop this when `limit=1`
np.array(
list(
np.repeat(i, limit if limit < len(b) else len(b))
for i in range(len(a))
)
).flatten()
)
.drop(columns=["match"])
.astype({"score": "int16"})
)
print(f"\t the index here represents the row in dataframe a on which to join")
print(to_join)
res = pd.merge(
left=a, right=to_join, left_index=True, right_index=True, suffixes=("", "_b")
)
# return only the highest match or you can just set the limit to 1
# and remove this
df = res.reset_index()
df = df.iloc[df.groupby(by="index")["score"].idxmax()].reset_index(drop=True)
return df.drop(columns=["City_b", "score", "index"])
def test(df):
expected = pd.DataFrame(
{
"City": ["San Francisco, CA", "Oakland, CA"],
"Val": [1, 2],
"Geo": ["geo1", "geo1"],
}
)
print(f'{"expected":-^70}')
print(expected)
print(f'{"res":-^70}')
print(df)
assert expected.equals(df)
if __name__ == "__main__":
a = pd.DataFrame({"City": ["San Francisco, CA", "Oakland, CA"], "Val": [1, 2]})
b = pd.DataFrame(
{"City": ["San Francisco-Oakland, CA", "Salinas, CA"], "Geo": ["geo1", "geo2"]}
)
print(f'\n\n{"fuzzy match":-^70}')
res = fuzzy_match(a, b, col="City")
test(res)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With