Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to join two dataframes for which column values are within a certain range?

Given two dataframes df_1 and df_2, how to join them such that datetime column df_1 is in between start and end in dataframe df_2:

print df_1    timestamp              A          B 0 2016-05-14 10:54:33    0.020228   0.026572 1 2016-05-14 10:54:34    0.057780   0.175499 2 2016-05-14 10:54:35    0.098808   0.620986 3 2016-05-14 10:54:36    0.158789   1.014819 4 2016-05-14 10:54:39    0.038129   2.384590   print df_2    start                end                  event     0 2016-05-14 10:54:31  2016-05-14 10:54:33  E1 1 2016-05-14 10:54:34  2016-05-14 10:54:37  E2 2 2016-05-14 10:54:38  2016-05-14 10:54:42  E3 

Get corresponding event where df1.timestamp is between df_2.start and df2.end

  timestamp              A          B          event 0 2016-05-14 10:54:33    0.020228   0.026572   E1 1 2016-05-14 10:54:34    0.057780   0.175499   E2 2 2016-05-14 10:54:35    0.098808   0.620986   E2 3 2016-05-14 10:54:36    0.158789   1.014819   E2 4 2016-05-14 10:54:39    0.038129   2.384590   E3 
like image 843
DougKruger Avatar asked Oct 02 '17 12:10

DougKruger


People also ask

Which are the 3 main ways of combining DataFrames together?

Combine data from multiple files into a single DataFrame using merge and concat. Combine two DataFrames using a unique ID found in both DataFrames. Employ to_csv to export a DataFrame in CSV format. Join DataFrames using common fields (join keys).


Video Answer


2 Answers

One simple solution is create interval index from start and end setting closed = both then use get_loc to get the event i.e (Hope all the date times are in timestamps dtype )

df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both') df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event']) 

Output :

             timestamp         A         B event 0 2016-05-14 10:54:33  0.020228  0.026572    E1 1 2016-05-14 10:54:34  0.057780  0.175499    E2 2 2016-05-14 10:54:35  0.098808  0.620986    E2 3 2016-05-14 10:54:36  0.158789  1.014819    E2 4 2016-05-14 10:54:39  0.038129  2.384590    E3 
like image 145
Bharath Avatar answered Oct 12 '22 03:10

Bharath


First use IntervalIndex to create a reference index based on the interval of interest, then use get_indexer to slice the dataframe which contains the discrete events of interest.

idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both') event = df_2.iloc[idx.get_indexer(df_1.timestamp), 'event']  event 0    E1 1    E2 1    E2 1    E2 2    E3 Name: event, dtype: object  df_1['event'] = event.to_numpy() df_1             timestamp         A         B event 0 2016-05-14 10:54:33  0.020228  0.026572    E1 1 2016-05-14 10:54:34  0.057780  0.175499    E2 2 2016-05-14 10:54:35  0.098808  0.620986    E2 3 2016-05-14 10:54:36  0.158789  1.014819    E2 4 2016-05-14 10:54:39  0.038129  2.384590    E3 

Reference: A question on IntervalIndex.get_indexer.

like image 44
cs95 Avatar answered Oct 12 '22 04:10

cs95