I want to calculate diff
by group. And I don’t know how to sort the time
column so that each group results are sorted and positive.
The original data :
In [37]: df Out[37]: id time 0 A 2016-11-25 16:32:17 1 A 2016-11-25 16:36:04 2 A 2016-11-25 16:35:29 3 B 2016-11-25 16:35:24 4 B 2016-11-25 16:35:46
The result I want
Out[40]: id time 0 A 00:35 1 A 03:12 2 B 00:22
notice: the type of time col is timedelta64[ns]
In [38]: df['time'].diff(1) Out[38]: 0 NaT 1 00:03:47 2 -1 days +23:59:25 3 -1 days +23:59:55 4 00:00:22 Name: time, dtype: timedelta64[ns]
Don't get desired result.
Not only solve the problem but the code can run fast because there are 50 million rows.
To calculate time difference between two Python Pandas columns in hours and minutes, we can subtract the datetime objects directly. We create a Panda DataFrame with 3 columns. Then we set the values of the to and fr columns to Pandas timestamps.
You can use sort_values
with groupby
and aggregating diff
:
df['diff'] = df.sort_values(['id','time']).groupby('id')['time'].diff() print (df) id time diff 0 A 2016-11-25 16:32:17 NaT 1 A 2016-11-25 16:36:04 00:00:35 2 A 2016-11-25 16:35:29 00:03:12 3 B 2016-11-25 16:35:24 NaT 4 B 2016-11-25 16:35:46 00:00:22
If need remove rows with NaT
in column diff
use dropna
:
df = df.dropna(subset=['diff']) print (df) id time diff 2 A 2016-11-25 16:35:29 00:03:12 1 A 2016-11-25 16:36:04 00:00:35 4 B 2016-11-25 16:35:46 00:00:22
You can also overwrite column:
df.time = df.sort_values(['id','time']).groupby('id')['time'].diff() print (df) id time 0 A NaT 1 A 00:00:35 2 A 00:03:12 3 B NaT 4 B 00:00:22
df.time = df.sort_values(['id','time']).groupby('id')['time'].diff() df = df.dropna(subset=['time']) print (df) id time 1 A 00:00:35 2 A 00:03:12 4 B 00:00:22
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With