I have two data frames, with two columns in common. I want to flag records in the 1st data frame that have a two-column match in the 2nd, where the second match is "fuzzy". Neither column is a unique key; values can be repeated in both of them. Imagine, for example, my matching columns are a city-column and a date-column.
Input dataframe 1 (df1):
| city | date |
|---|---|
| New York | 1/1/2024 |
| New York | 1/5/2024 |
| Chicago | 1/2/2024 |
| Chicago | 1/5/2024 |
| Houston | 1/3/2024 |
Input dataframe 2 (df2):
| city | date |
|---|---|
| New York | 1/2/2024 |
| Chicago | 1/5/2024 |
My desired output is to flag as True every record in df1 that has a record in df2 that exactly matches the city AND matches within one day of the date, with all other records flagged as False.
Output dataframe 1:
| city | date | paired |
|---|---|---|
| New York | 1/1/2024 | True |
| New York | 1/5/2024 | False |
| Chicago | 1/2/2024 | False |
| Chicago | 1/5/2024 | True |
| Houston | 1/3/2024 | False |
If I was seeking exact matches on both columns, the problem would be very straightforward: I would zip the two columns to be used for matching together, and apply an isin test inside of a where method:
df1['zipcol'] = list(zip(df1.city, df1.date))
df2['zipcol'] = list(zip(df2.city, df2.date))
df1['paired'] = np.where(df1['zipcol'].isin(df2['zipcol']), True, False)
#some overhead to drop the zipped columns
What is the most efficient way to perform this type of comparison, except with one (or more) criteria that is not exact, but defined in terms of the values from the 2nd data frame being within a range of values centered on values from the 1st?
Final note: I am not worried about the possibility of duplicate matches here. If one-to-many matches do occur, they can be handled in a post-processing step later on.
You need a merge_asof with direction='nearest' and 1 day of tolerance.
The important points are:
to_datetimereset_index (a merge doesn't maintain the index)sort_valuesdf2 (here with value 1) to act as flag for values present in df2.df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])
df1['paired'] = (
pd.merge_asof(
df1.reset_index().sort_values(by='date'),
df2.sort_values(by='date').assign(flag=1),
on='date',
by='city',
direction='nearest',
tolerance=pd.Timedelta('1d')
)
.set_index('index')['flag']
.eq(1)
)
Output:
city date paired
0 New York 2024-01-01 True
1 New York 2024-01-05 False
2 Chicago 2024-01-02 False
3 Chicago 2024-01-05 True
4 Houston 2024-01-03 False
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With