As an example, I have the following dataframe:
Date indicator_1 indicator_2
2013-04-01 03:50:00 x w
2013-04-01 04:00:00 y u
2013-04-01 04:15:00 z v
2013-04-01 04:25:00 x w
2013-04-01 04:25:00 z u
2013-04-01 04:30:00 y u
2013-04-01 04:35:00 y w
2013-04-01 04:40:00 z w
2013-04-01 04:40:00 x u
2013-04-01 04:40:00 y v
2013-04-01 04:50:00 x w
My goal is to create two columns with the following rules:
The first column should give me the number of minutes since the last occurrence of 'x' on the indicator_1 column.
The second column should give me the number of minutes since the last occurrence of the pair 'y' on the indicator_1 and 'u' on the indicator_2 column.
For rows that have the same exact hour and one of the times corresponds to either 'x' (in the first case), or the pair 'y','u' (in the second case), the calculation of the number of minutes should be made with respect to the previous occurrence of the variables. Therefore, the desired output should be something like:
Date desired_column_1 desired_column_2 indicator_1 indicator_2
2013-04-01 03:50:00 NaN NaN x w
2013-04-01 04:00:00 10.0 NaN y u
2013-04-01 04:15:00 25.0 15.0 z v
2013-04-01 04:25:00 35.0 25.0 x w
2013-04-01 04:25:00 35.0 25.0 z u
2013-04-01 04:30:00 5.0 30.0 y u
2013-04-01 04:35:00 10.0 5.0 y w
2013-04-01 04:40:00 15.0 10.0 z w
2013-04-01 04:40:00 15.0 10.0 x u
2013-04-01 04:40:00 15.0 10.0 y v
2013-04-01 04:50:00 10.0 20.0 x w
The main problem is that the entire dataframe has more than 2 million rows, therefore using a loop is too time consuming. Is there any way to implement a vectorized approach to this problem?
The python code of the dataframe is the following:
d = {'Date': ['2013-04-01 03:50:00','2013-04-01 04:00:00','2013-04-01
04:15:00','2013-04-01 04:25:00','2013-04-01 04:25:00',
'2013-04-01 04:30:00','2013-04-01 04:35:00','2013-04-01 04:40:00','2013-04-01 04:40:00','2013-04-01 04:40:00',
'2013-04-01 04:50:00'], 'indicator_1': ['x','y','z','x','z','y','y','z','x','y','x'],
'indicator_2': ['w','u','v','w','u','u','w','w','u','v','w'],
'desired_column_1': [np.nan, 10, 25, 35, 35,5,10,15,15,15,10],
'desired_column_2': [np.nan, np.nan, 15, 25, 25,30,5,10,10,10,20]}
df = pd.DataFrame(data=d)
First ensure column ['Date']
is a datetime object and get a column to represent the difference in time from row to row
df.Date = pd.to_datetime(df.Date)
df['minD'] = (df.Date -df.Date.shift(1)).astype('timedelta64[m]')
Next create a grouping key for your conditions. We shift down one row since we are looking for the time since last x and this can include the next x value as well. Without shifting, we would not include the next x in our group.
mask2 = (df.indicator_1.str.cat(df.indicator_2) == 'yu').cumsum().shift(1)
mask1 = (df.indicator_1 == 'x').cumsum().shift(1)
Now groupby the masks and cumsum()
the minute differences, but we need to filter out the cumsum()
< 1 of the boolean values since the condition has not yet happened, and thus there should be missing values for difference in time.
df['desired_column_1'] = df.groupby(mask1.where(mask1 > 0)).minD.cumsum()
df['desired_column_2'] = df.groupby(mask2.where(mask2 > 0)).minD.cumsum()
Now you can replace the 0 values in those columns by forward filling the data
df.desired_column_1 = df.desired_column_1.replace(0,method='ffill')
df.desired_column_2 = df.desired_column_2.replace(0,method='ffill')
This yeilds
Date indicator_1 indicator_2 desired_column_1 \
0 2013-04-01 03:50:00 x w NaN
1 2013-04-01 04:00:00 y u 10.0
2 2013-04-01 04:15:00 z v 25.0
3 2013-04-01 04:25:00 x w 35.0
4 2013-04-01 04:25:00 z u 35.0
5 2013-04-01 04:30:00 y u 5.0
6 2013-04-01 04:35:00 y w 10.0
7 2013-04-01 04:40:00 z w 15.0
8 2013-04-01 04:40:00 x u 15.0
9 2013-04-01 04:40:00 y v 15.0
10 2013-04-01 04:50:00 x w 10.0
desired_column_2
0 NaN
1 NaN
2 15.0
3 25.0
4 25.0
5 30.0
6 5.0
7 10.0
8 10.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With