Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas - Minutes since last occurrence in 2 million row dataframe

Tags:

As an example, I have the following dataframe:

Date                 indicator_1    indicator_2
2013-04-01 03:50:00       x             w
2013-04-01 04:00:00       y             u
2013-04-01 04:15:00       z             v
2013-04-01 04:25:00       x             w 
2013-04-01 04:25:00       z             u
2013-04-01 04:30:00       y             u
2013-04-01 04:35:00       y             w
2013-04-01 04:40:00       z             w
2013-04-01 04:40:00       x             u
2013-04-01 04:40:00       y             v
2013-04-01 04:50:00       x             w

My goal is to create two columns with the following rules:

  • The first column should give me the number of minutes since the last occurrence of 'x' on the indicator_1 column.

  • The second column should give me the number of minutes since the last occurrence of the pair 'y' on the indicator_1 and 'u' on the indicator_2 column.

For rows that have the same exact hour and one of the times corresponds to either 'x' (in the first case), or the pair 'y','u' (in the second case), the calculation of the number of minutes should be made with respect to the previous occurrence of the variables. Therefore, the desired output should be something like:

 Date               desired_column_1   desired_column_2  indicator_1 indicator_2
2013-04-01 03:50:00         NaN                NaN          x              w
2013-04-01 04:00:00         10.0               NaN          y              u
2013-04-01 04:15:00         25.0               15.0         z              v
2013-04-01 04:25:00         35.0               25.0         x              w
2013-04-01 04:25:00         35.0               25.0         z              u
2013-04-01 04:30:00          5.0               30.0         y              u
2013-04-01 04:35:00         10.0                5.0         y              w
2013-04-01 04:40:00         15.0               10.0         z              w
2013-04-01 04:40:00         15.0               10.0         x              u
2013-04-01 04:40:00         15.0               10.0         y              v
2013-04-01 04:50:00         10.0               20.0         x              w

The main problem is that the entire dataframe has more than 2 million rows, therefore using a loop is too time consuming. Is there any way to implement a vectorized approach to this problem?

The python code of the dataframe is the following:

d = {'Date': ['2013-04-01 03:50:00','2013-04-01 04:00:00','2013-04-01 
04:15:00','2013-04-01 04:25:00','2013-04-01 04:25:00',
'2013-04-01 04:30:00','2013-04-01 04:35:00','2013-04-01 04:40:00','2013-04-01 04:40:00','2013-04-01 04:40:00',
'2013-04-01 04:50:00'], 'indicator_1': ['x','y','z','x','z','y','y','z','x','y','x'], 
 'indicator_2': ['w','u','v','w','u','u','w','w','u','v','w'],
 'desired_column_1': [np.nan, 10, 25, 35, 35,5,10,15,15,15,10],
 'desired_column_2': [np.nan, np.nan, 15, 25, 25,30,5,10,10,10,20]}

df = pd.DataFrame(data=d)
like image 533
Miguel Lambelho Avatar asked Aug 15 '18 10:08

Miguel Lambelho


1 Answers

First ensure column ['Date'] is a datetime object and get a column to represent the difference in time from row to row

df.Date = pd.to_datetime(df.Date)
df['minD'] = (df.Date -df.Date.shift(1)).astype('timedelta64[m]')

Next create a grouping key for your conditions. We shift down one row since we are looking for the time since last x and this can include the next x value as well. Without shifting, we would not include the next x in our group.

mask2 = (df.indicator_1.str.cat(df.indicator_2) == 'yu').cumsum().shift(1)
mask1 = (df.indicator_1 == 'x').cumsum().shift(1)

Now groupby the masks and cumsum() the minute differences, but we need to filter out the cumsum() < 1 of the boolean values since the condition has not yet happened, and thus there should be missing values for difference in time.

df['desired_column_1'] = df.groupby(mask1.where(mask1 > 0)).minD.cumsum() 
df['desired_column_2'] = df.groupby(mask2.where(mask2 > 0)).minD.cumsum()

Now you can replace the 0 values in those columns by forward filling the data

df.desired_column_1 = df.desired_column_1.replace(0,method='ffill')
df.desired_column_2 = df.desired_column_2.replace(0,method='ffill')

This yeilds

               Date indicator_1 indicator_2  desired_column_1  \
0  2013-04-01 03:50:00           x           w               NaN
1  2013-04-01 04:00:00           y           u              10.0
2  2013-04-01 04:15:00           z           v              25.0
3  2013-04-01 04:25:00           x           w              35.0
4  2013-04-01 04:25:00           z           u              35.0
5  2013-04-01 04:30:00           y           u               5.0
6  2013-04-01 04:35:00           y           w              10.0
7  2013-04-01 04:40:00           z           w              15.0
8  2013-04-01 04:40:00           x           u              15.0
9  2013-04-01 04:40:00           y           v              15.0
10 2013-04-01 04:50:00           x           w              10.0

    desired_column_2
0                NaN
1                NaN
2               15.0
3               25.0
4               25.0
5               30.0
6                5.0
7               10.0
8               10.0
like image 53
DJK Avatar answered Sep 28 '22 19:09

DJK