Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Merge on exact ID and closest date

I'm trying to merge two Pandas dataframes on two columns. One column has a unique identifier that could be used to simply .merge() the two dataframes. However, the second column merge would actually use .merge_asof() because it would need to find the closest date, not an exact date match.

There is a similar question here: Pandas Merge on Name and Closest Date, but it was asked and answered nearly three years ago, and merge_asof() is a much newer addition.

I asked a similar here question a couple months ago, but the solution only needed to use merge_asof() without any exact matches required.

In the interest of including some code, it would look something like this:

df = pd.merge_asof(df1, df2, left_on=['ID','date_time'], right_on=['ID','date_time'])

where the ID's will match exactly, but the date_time's will be "near matches".

Any help is greatly appreciated.

like image 780
elPastor Avatar asked Oct 18 '22 17:10

elPastor


1 Answers

Consider merging first on the ID and then run a DataFrame.apply to return highest date_time from first dataframe on matched IDs less than the current row date_time from second dataframe.

# INITIAL MERGE (CROSS-PRODUCT OF ALL ID PAIRINGS)
mdf = pd.merge(df1, df2, on=['ID'])

def f(row):
    col = mdf[(mdf['ID'] == row['ID']) & 
              (mdf['date_time_x'] < row['date_time_y'])]['date_time_x'].max()
    return col

# FILTER BY MATCHED DATES TO CONDITIONAL MAX
mdf = mdf[mdf['date_time_x'] ==  mdf.apply(f, axis=1)].reset_index(drop=True)

This assumes you want to keep all rows of df2 (i.e., right join). Simply flip _x / _y suffixes for left join.

like image 88
Parfait Avatar answered Oct 20 '22 09:10

Parfait