Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating row-wise time difference in python

I want to calculate the travel time of each passengers in my data frame based on the difference between the moment where they first get in the bus and the moment they leave.

Here is the data frame

my_df = pd.DataFrame({
    'id': ['a', 'b', 'b', 'b', 'b', 'b', 'c','d'],
    'date': ['2020/02/03', '2020/04/05', '2020/04/05', '2020/04/05','2020/04/06', '2020/04/06', '2020/12/15', '2020/06/23'],
    'arriving_time': ['14:36:06', '08:52:02', '08:53:02', '08:55:24', '18:58:03', '19:03:05', '17:04:28', '21:31:23'],
    'leaving_time': ['14:40:05', '08:52:41', '08:54:33', '08:57:14', '19:01:07', '19:04:08', '17:09:48', '21:50:12']
})
print(my_df)

output:

    id  date    arriving_time   leaving_time
0   a   2020/02/03  14:36:06    14:40:05
1   b   2020/04/05  08:52:02    08:52:41
2   b   2020/04/05  08:53:02    08:54:33
3   b   2020/04/05  08:55:24    08:57:14
4   b   2020/04/06  18:58:03    19:01:07
5   b   2020/04/06  19:03:05    19:04:08
6   c   2020/12/15  17:04:28    17:09:48
7   d   2020/06/23  21:31:23    21:50:12

However there is two problems (that I don't manage to solve myself):

  • passengers are detected via their phone signal but the signal is often unstable, this is why for a same person, we can have many rows (like the passenger b in the above data set). "arriving_time" is the time where the signal is detected and "leaving_time" the time where the signal is lost
  • To compute the travel time, I need to substract, for each unique ID and for each travel, the least recent arriving_time to the most recent leaving time.

Here is the result I want to obtain

id  date    arriving_time   leaving_time    travelTime
0   a   2020/02/03  14:36:06    14:40:05    00:03:59
1   b   2020/04/05  08:52:02    08:52:41    00:05:12
2   b   2020/04/05  08:53:02    08:54:33    00:05:12
3   b   2020/04/05  08:55:24    08:57:14    00:05:12
4   b   2020/04/06  18:58:03    19:01:07    00:06:05
5   b   2020/04/06  19:03:05    19:04:08    00:06:05
6   c   2020/12/15  17:04:28    17:09:48    00:05:20
7   d   2020/06/23  21:31:23    21:50:12    00:18:49

As you can see, passenger b made two different travel on the same day, and I want to know compute how long each one of them last.

I already tried the following code, which seems to work, but it is really slow (which I think is due to the large amount of rows of my_df)

for user_id in set(my_df.id):
    for day in set(my_df.loc[my_df.id == user_id, 'date']):
        my_df.loc[(my_df.id == user_id) & (my_df.date == day), 'travelTime'] = max(my_df.loc[(my_df.id == user_id) & (my_df.date == day), 'leaving_time'].apply(pd.to_datetime)) - min(my_df.loc[(my_df.id == user_id) & (my_df.date == day), 'arriving_time'].apply(pd.to_datetime))
like image 727
Le Noff Avatar asked Mar 19 '20 11:03

Le Noff


3 Answers

I think for correct maximal and minimal values are converted columns to datetimes and then subtract Series created by GroupBy.transform:

my_df['s'] = pd.to_datetime(my_df['date'] + ' ' + my_df['arriving_time'])
my_df['e'] = pd.to_datetime(my_df['date'] + ' ' + my_df['leaving_time'])

g = my_df.groupby(['id', 'date'])
my_df['travelTime'] = g['e'].transform('max').sub(g['s'].transform('min'))
print (my_df)
  id        date arriving_time leaving_time                   s  \
0  a  2020/02/03      14:36:06     14:40:05 2020-02-03 14:36:06   
1  b  2020/04/05      08:52:02     08:52:41 2020-04-05 08:52:02   
2  b  2020/04/05      08:53:02     08:54:33 2020-04-05 08:53:02   
3  b  2020/04/05      08:55:24     08:57:14 2020-04-05 08:55:24   
4  b  2020/04/06      18:58:03     19:01:07 2020-04-06 18:58:03   
5  b  2020/04/06      19:03:05     19:04:08 2020-04-06 19:03:05   
6  c  2020/12/15      17:04:28     17:09:48 2020-12-15 17:04:28   
7  d  2020/06/23      21:31:23     21:50:12 2020-06-23 21:31:23   

                    e travelTime  
0 2020-02-03 14:40:05   00:03:59  
1 2020-04-05 08:52:41   00:05:12  
2 2020-04-05 08:54:33   00:05:12  
3 2020-04-05 08:57:14   00:05:12  
4 2020-04-06 19:01:07   00:06:05  
5 2020-04-06 19:04:08   00:06:05  
6 2020-12-15 17:09:48   00:05:20  
7 2020-06-23 21:50:12   00:18:49  

For avoid new columns is possible use DataFrame.assign Series with datetimes:

s = pd.to_datetime(my_df['date'] + ' ' + my_df['arriving_time'])
e = pd.to_datetime(my_df['date'] + ' ' + my_df['leaving_time'])

g = my_df.assign(s=s, e=e).groupby(['id', 'date'])
my_df['travelTime'] = g['e'].transform('max').sub(g['s'].transform('min'))
print (my_df)
  id        date arriving_time leaving_time travelTime
0  a  2020/02/03      14:36:06     14:40:05   00:03:59
1  b  2020/04/05      08:52:02     08:52:41   00:05:12
2  b  2020/04/05      08:53:02     08:54:33   00:05:12
3  b  2020/04/05      08:55:24     08:57:14   00:05:12
4  b  2020/04/06      18:58:03     19:01:07   00:06:05
5  b  2020/04/06      19:03:05     19:04:08   00:06:05
6  c  2020/12/15      17:04:28     17:09:48   00:05:20
7  d  2020/06/23      21:31:23     21:50:12   00:18:49
like image 114
jezrael Avatar answered Oct 20 '22 14:10

jezrael


IIUC we first groupby id & date to get the max and min leave & arrival time.

then a simple subtraction.

df2 = df.groupby(['id','date']).agg(min_arrival=('arriving_time','min'),
                             max_leave=('leaving_time','max'))


df2['travelTime'] =  pd.to_datetime(df2['max_leave']) - pd.to_datetime(df2['min_arrival']) 


print(df2)

              min_arrival max_leave travelTime
id date                                       
a  2020-02-03    14:36:06  14:40:05   00:03:59
b  2020-04-05    08:52:02  08:57:14   00:05:12
   2020-04-06    18:58:03  19:04:08   00:06:05
c  2020-12-15    17:04:28  17:09:48   00:05:20
d  2020-06-23    21:31:23  21:50:12   00:18:49

if you want this back on yout original df, you could use transform or merge the values from the new delta onto your original :

df_new = (pd.merge(df,df2[['travelTime']],on=['date','id'],how='left')

  id       date arriving_time leaving_time   travelTime
0  a 2020-02-03      14:36:06     14:40:05     00:03:59
1  b 2020-04-05      08:52:02     08:52:41     00:05:12
2  b 2020-04-05      08:53:02     08:54:33     00:05:12
3  b 2020-04-05      08:55:24     08:57:14     00:05:12
4  b 2020-04-06      18:58:03     19:01:07     00:06:05
5  b 2020-04-06      19:03:05     19:04:08     00:06:05
6  c 2020-12-15      17:04:28     17:09:48     00:05:20
7  d 2020-06-23      21:31:23     21:50:12     00:18:49
like image 32
Umar.H Avatar answered Oct 20 '22 16:10

Umar.H


You could try this -

my_df['arriving_time'] = pd.to_datetime(my_df['arriving_time'])
my_df['leaving_time'] = pd.to_datetime(my_df['leaving_time'])
my_df['travel_time'] = my_df.groupby(['id', 'date'])['leaving_time'].transform('max') - my_df.groupby(['id', 'date'])['arriving_time'].transform('min')
my_df
    id        date       arriving_time        leaving_time travel_time
0  a  2020/02/03 2020-03-19 14:36:06 2020-03-19 14:40:05    00:03:59
1  b  2020/04/05 2020-03-19 08:52:02 2020-03-19 08:52:41    00:05:12
2  b  2020/04/05 2020-03-19 08:53:02 2020-03-19 08:54:33    00:05:12
3  b  2020/04/05 2020-03-19 08:55:24 2020-03-19 08:57:14    00:05:12
4  b  2020/04/06 2020-03-19 18:58:03 2020-03-19 19:01:07    00:06:05
5  b  2020/04/06 2020-03-19 19:03:05 2020-03-19 19:04:08    00:06:05
6  c  2020/12/15 2020-03-19 17:04:28 2020-03-19 17:09:48    00:05:20
7  d  2020/06/23 2020-03-19 21:31:23 2020-03-19 21:50:12    00:18:49
like image 2
Sajan Avatar answered Oct 20 '22 14:10

Sajan