Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate time difference by group using pandas?

Problem

I want to calculate diff by group. And I don’t know how to sort the time column so that each group results are sorted and positive.

The original data :

In [37]: df  Out[37]:   id                time 0  A 2016-11-25 16:32:17 1  A 2016-11-25 16:36:04 2  A 2016-11-25 16:35:29 3  B 2016-11-25 16:35:24 4  B 2016-11-25 16:35:46 

The result I want

Out[40]:    id   time 0  A   00:35 1  A   03:12 2  B   00:22 

notice: the type of time col is timedelta64[ns]

Trying

In [38]: df['time'].diff(1) Out[38]: 0                 NaT 1            00:03:47 2   -1 days +23:59:25 3   -1 days +23:59:55 4            00:00:22 Name: time, dtype: timedelta64[ns] 

Don't get desired result.

Hope

Not only solve the problem but the code can run fast because there are 50 million rows.

like image 310
Jack Avatar asked Nov 25 '16 11:11

Jack


People also ask

How do I calculate time difference between two columns in pandas?

To calculate time difference between two Python Pandas columns in hours and minutes, we can subtract the datetime objects directly. We create a Panda DataFrame with 3 columns. Then we set the values of the to and fr columns to Pandas timestamps.


1 Answers

You can use sort_values with groupby and aggregating diff:

df['diff'] = df.sort_values(['id','time']).groupby('id')['time'].diff() print (df)   id                time     diff 0  A 2016-11-25 16:32:17      NaT 1  A 2016-11-25 16:36:04 00:00:35 2  A 2016-11-25 16:35:29 00:03:12 3  B 2016-11-25 16:35:24      NaT 4  B 2016-11-25 16:35:46 00:00:22 

If need remove rows with NaT in column diff use dropna:

df = df.dropna(subset=['diff']) print (df)   id                time     diff 2  A 2016-11-25 16:35:29 00:03:12 1  A 2016-11-25 16:36:04 00:00:35 4  B 2016-11-25 16:35:46 00:00:22 

You can also overwrite column:

df.time = df.sort_values(['id','time']).groupby('id')['time'].diff() print (df)   id     time 0  A      NaT 1  A 00:00:35 2  A 00:03:12 3  B      NaT 4  B 00:00:22 

df.time = df.sort_values(['id','time']).groupby('id')['time'].diff() df = df.dropna(subset=['time']) print (df)   id     time 1  A 00:00:35 2  A 00:03:12 4  B 00:00:22 
like image 121
jezrael Avatar answered Sep 17 '22 21:09

jezrael