Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Query for one dataframe row based on row in another dataframe & compare values

So I have two data frames. The first data frame contains numerical data that is used to "score" the second data frame which contains simulation data.

df1 = base records

df2 = simulation records

Part 1: What I am trying to accomplish is to query df1 'base records' to find the row that has the most recent timestamp to that in the df2 'simulation records' where the "Name" & "Time" columns match exactly.

Part 2: Then I want to use an if then function to determine whether a value in the simulation record row fall between a range created using two values from the base record row and return a boolean.

low range = df1['Po']-df1['Ref']

high range = df1['Po']+df1['Ref']

if df2['Sim'] falls in between the low range & high range of its most recent df1 base record then I want to return true in the new column "Sim Score" otherwise return false

Part 3: I want to repeat Part 1 & Part 2 for each row in the simulation records.

helpful information:

  • df1 (base records) have more or less rows than df2 (simulation records)
  • df1 has more columns than df2
  • some columns in df1 have the same name but different values in df2
  • ideally want to be able to slice both dataframes where the if then function only sees the two rows used in the comparison
  • only need the most recent df1 base record to compare to the df2 simulation record
  • previously accomplished this in google sheets with if then & query combination formula dragged down entire sheet (want to replace with python & pandas)
df1 base records example (columns that matter)

Timestamp            Name     Time     Po  Ref

7/11/2022 11:30:00   trial   20 mins   5   2

7/10/2022 04:00:00   trial   20 mins   4   4

7/09/2022 02:45:00   trial   20 mins   2   2

6/28/2022 03:45:00   trial   20 mins   3   6
df2 simulation records example (columns that matter)

Timestamp             Name     Time     Sim

7/10/2022 05:15:00    trial   20 mins   7

7/11/2022 12:45:00    trial   20 mins   4

7/12/2022 03:30:00    trial   20 mins   8
desired result of new column added to df2

Timestamp             Name     Time     Sim  Sim Score

7/10/2022 05:15:00    trial   20 mins   7    True

7/11/2022 12:45:00    trial   20 mins   4    True

7/12/2022 03:30:00    trial   20 mins   8    False
like image 460
Chloe Avatar asked Feb 19 '26 23:02

Chloe


1 Answers

Use pandas.DataFrame.reindex, its method offer nearest to find the computable index(e.g., string can not calculate distance)

Or use merge_asof, its direction offer nearest.





Method 1:

reindex() with method='nearest'

df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
print(df1)
###
                      Name     Time  Po  Ref  l_r  h_r
Timestamp                                             
2022-07-11 11:30:00  trial  20 mins   5    2    3    7
2022-07-10 04:00:00  trial  20 mins   4    4    0    8
2022-07-09 02:45:00  trial  20 mins   2    2    0    4
2022-06-28 03:45:00  trial  20 mins   3    6   -3    9
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
df2.set_index('Timestamp', inplace=True)
print(df2)
###
                      Name     Time  Sim
Timestamp                               
2022-07-10 05:15:00  trial  20 mins    7
2022-07-11 12:45:00  trial  20 mins    4
2022-07-12 03:30:00  trial  20 mins    8
temp = df2.join(df1.reindex(df2.index, method='nearest'), lsuffix='_left', rsuffix='_right')
print(temp)

enter image description here

As you can see, this is df2.join(df1),

join multiple DataFrame objects by index at once.

with method='nearest', in this case, it would join df2 and df1 by the nearest Timestamp index.


df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
df2.reset_index(inplace=True)
print(df2)
###
            Timestamp   Name     Time  Sim  Sim Score
0 2022-07-10 05:15:00  trial  20 mins    7       True
1 2022-07-11 12:45:00  trial  20 mins    4       True
2 2022-07-12 03:30:00  trial  20 mins    8      False




Method 2:

merge_asof() with direction='nearest' this way is not executed with indexed value, therefore we don't have to set column Timestamp as index. But it needs binding objects(in this case we merge on column Timestamp)sorted.

df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
# df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
            Timestamp   Name     Time  Po  Ref  l_r  h_r
3 2022-06-28 03:45:00  trial  20 mins   3    6   -3    9
2 2022-07-09 02:45:00  trial  20 mins   2    2    0    4
1 2022-07-10 04:00:00  trial  20 mins   4    4    0    8
0 2022-07-11 11:30:00  trial  20 mins   5    2    3    7
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
# df2.set_index('Timestamp', inplace=True)
df2.sort_values(by='Timestamp', inplace=True)
print(df2)
###
            Timestamp   Name     Time  Sim
0 2022-07-10 05:15:00  trial  20 mins    7
1 2022-07-11 12:45:00  trial  20 mins    4
2 2022-07-12 03:30:00  trial  20 mins    8

temp = pd.merge_asof(df2 ,df1[['Timestamp', 'l_r', 'h_r']], on='Timestamp', direction='nearest')
print(temp)

enter image description here As you can see, this is pd.merge_asof(df2, df1),

This is similar to a left-join except that we match on nearest key rather than equal keys. Both DataFrames must be sorted by the key.

For each row in the left DataFrame:

A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.

df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
            Timestamp   Name     Time  Sim  Sim Score
0 2022-07-10 05:15:00  trial  20 mins    7       True
1 2022-07-11 12:45:00  trial  20 mins    4       True
2 2022-07-12 03:30:00  trial  20 mins    8      False

Frankly speaking, working on indexed things would be faster if you have a large dataset.




Method 2 (on multiple keys)

I remodified df1 adding different Name and Time

df1 = pd.DataFrame({'Timestamp':['7/11/2022 11:30:00','7/11/2022 11:30:00','7/10/2022 04:00:00','7/10/2022 04:00:00','7/09/2022 02:45:00','6/28/2022 03:45:00'],
                    'Name':['trial','trial','trial','non-trial','trial','trial'],
                    'Time':['20 mins','30 mins','20 mins','20 mins','20 mins','20 mins'],
                    'Po':[5, 6, 4, 1, 2, 3],
                    'Ref':[2, 2, 4, 3, 2, 6]})
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
            Timestamp       Name     Time  Po  Ref  l_r  h_r
5 2022-06-28 03:45:00      trial  20 mins   3    6   -3    9
4 2022-07-09 02:45:00      trial  20 mins   2    2    0    4
2 2022-07-10 04:00:00      trial  20 mins   4    4    0    8
3 2022-07-10 04:00:00  non-trial  20 mins   1    3   -2    4
0 2022-07-11 11:30:00      trial  20 mins   5    2    3    7
1 2022-07-11 11:30:00      trial  30 mins   6    2    4    8
print(df2)
###
            Timestamp   Name     Time  Sim
0 2022-07-10 05:15:00  trial  20 mins    7
1 2022-07-11 12:45:00  trial  20 mins    4
2 2022-07-12 03:30:00  trial  20 mins    8

Important:

can only merge_asof on a single key, therefore others would utilize by= to deal with.

temp = pd.merge_asof(df2, df1[['Timestamp', 'Name', 'Time', 'l_r', 'h_r']], on='Timestamp', by=['Name','Time'], direction='nearest')
print(temp)

enter image description here

df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
            Timestamp   Name     Time  Sim  Sim Score
0 2022-07-10 05:15:00  trial  20 mins    7       True
1 2022-07-11 12:45:00  trial  20 mins    4       True
2 2022-07-12 03:30:00  trial  20 mins    8      False

Reference:
pandas.DataFrame.join
pandas.merge_asof
merging/join concept

like image 74
Baron Legendre Avatar answered Feb 21 '26 12:02

Baron Legendre