I want to calculate the travel time of each passengers in my data frame based on the difference between the moment where they first get in the bus and the moment they leave.
Here is the data frame
my_df = pd.DataFrame({
'id': ['a', 'b', 'b', 'b', 'b', 'b', 'c','d'],
'date': ['2020/02/03', '2020/04/05', '2020/04/05', '2020/04/05','2020/04/06', '2020/04/06', '2020/12/15', '2020/06/23'],
'arriving_time': ['14:36:06', '08:52:02', '08:53:02', '08:55:24', '18:58:03', '19:03:05', '17:04:28', '21:31:23'],
'leaving_time': ['14:40:05', '08:52:41', '08:54:33', '08:57:14', '19:01:07', '19:04:08', '17:09:48', '21:50:12']
})
print(my_df)
output:
id date arriving_time leaving_time
0 a 2020/02/03 14:36:06 14:40:05
1 b 2020/04/05 08:52:02 08:52:41
2 b 2020/04/05 08:53:02 08:54:33
3 b 2020/04/05 08:55:24 08:57:14
4 b 2020/04/06 18:58:03 19:01:07
5 b 2020/04/06 19:03:05 19:04:08
6 c 2020/12/15 17:04:28 17:09:48
7 d 2020/06/23 21:31:23 21:50:12
However there is two problems (that I don't manage to solve myself):
Here is the result I want to obtain
id date arriving_time leaving_time travelTime
0 a 2020/02/03 14:36:06 14:40:05 00:03:59
1 b 2020/04/05 08:52:02 08:52:41 00:05:12
2 b 2020/04/05 08:53:02 08:54:33 00:05:12
3 b 2020/04/05 08:55:24 08:57:14 00:05:12
4 b 2020/04/06 18:58:03 19:01:07 00:06:05
5 b 2020/04/06 19:03:05 19:04:08 00:06:05
6 c 2020/12/15 17:04:28 17:09:48 00:05:20
7 d 2020/06/23 21:31:23 21:50:12 00:18:49
As you can see, passenger b made two different travel on the same day, and I want to know compute how long each one of them last.
I already tried the following code, which seems to work, but it is really slow (which I think is due to the large amount of rows of my_df)
for user_id in set(my_df.id):
for day in set(my_df.loc[my_df.id == user_id, 'date']):
my_df.loc[(my_df.id == user_id) & (my_df.date == day), 'travelTime'] = max(my_df.loc[(my_df.id == user_id) & (my_df.date == day), 'leaving_time'].apply(pd.to_datetime)) - min(my_df.loc[(my_df.id == user_id) & (my_df.date == day), 'arriving_time'].apply(pd.to_datetime))
I think for correct maximal and minimal values are converted columns to datetimes and then subtract Series
created by GroupBy.transform
:
my_df['s'] = pd.to_datetime(my_df['date'] + ' ' + my_df['arriving_time'])
my_df['e'] = pd.to_datetime(my_df['date'] + ' ' + my_df['leaving_time'])
g = my_df.groupby(['id', 'date'])
my_df['travelTime'] = g['e'].transform('max').sub(g['s'].transform('min'))
print (my_df)
id date arriving_time leaving_time s \
0 a 2020/02/03 14:36:06 14:40:05 2020-02-03 14:36:06
1 b 2020/04/05 08:52:02 08:52:41 2020-04-05 08:52:02
2 b 2020/04/05 08:53:02 08:54:33 2020-04-05 08:53:02
3 b 2020/04/05 08:55:24 08:57:14 2020-04-05 08:55:24
4 b 2020/04/06 18:58:03 19:01:07 2020-04-06 18:58:03
5 b 2020/04/06 19:03:05 19:04:08 2020-04-06 19:03:05
6 c 2020/12/15 17:04:28 17:09:48 2020-12-15 17:04:28
7 d 2020/06/23 21:31:23 21:50:12 2020-06-23 21:31:23
e travelTime
0 2020-02-03 14:40:05 00:03:59
1 2020-04-05 08:52:41 00:05:12
2 2020-04-05 08:54:33 00:05:12
3 2020-04-05 08:57:14 00:05:12
4 2020-04-06 19:01:07 00:06:05
5 2020-04-06 19:04:08 00:06:05
6 2020-12-15 17:09:48 00:05:20
7 2020-06-23 21:50:12 00:18:49
For avoid new columns is possible use DataFrame.assign
Series with datetimes
:
s = pd.to_datetime(my_df['date'] + ' ' + my_df['arriving_time'])
e = pd.to_datetime(my_df['date'] + ' ' + my_df['leaving_time'])
g = my_df.assign(s=s, e=e).groupby(['id', 'date'])
my_df['travelTime'] = g['e'].transform('max').sub(g['s'].transform('min'))
print (my_df)
id date arriving_time leaving_time travelTime
0 a 2020/02/03 14:36:06 14:40:05 00:03:59
1 b 2020/04/05 08:52:02 08:52:41 00:05:12
2 b 2020/04/05 08:53:02 08:54:33 00:05:12
3 b 2020/04/05 08:55:24 08:57:14 00:05:12
4 b 2020/04/06 18:58:03 19:01:07 00:06:05
5 b 2020/04/06 19:03:05 19:04:08 00:06:05
6 c 2020/12/15 17:04:28 17:09:48 00:05:20
7 d 2020/06/23 21:31:23 21:50:12 00:18:49
IIUC we first groupby
id
& date
to get the max and min leave & arrival time.
then a simple subtraction.
df2 = df.groupby(['id','date']).agg(min_arrival=('arriving_time','min'),
max_leave=('leaving_time','max'))
df2['travelTime'] = pd.to_datetime(df2['max_leave']) - pd.to_datetime(df2['min_arrival'])
print(df2)
min_arrival max_leave travelTime
id date
a 2020-02-03 14:36:06 14:40:05 00:03:59
b 2020-04-05 08:52:02 08:57:14 00:05:12
2020-04-06 18:58:03 19:04:08 00:06:05
c 2020-12-15 17:04:28 17:09:48 00:05:20
d 2020-06-23 21:31:23 21:50:12 00:18:49
if you want this back on yout original df, you could use transform
or merge the values from the new delta onto your original :
df_new = (pd.merge(df,df2[['travelTime']],on=['date','id'],how='left')
id date arriving_time leaving_time travelTime
0 a 2020-02-03 14:36:06 14:40:05 00:03:59
1 b 2020-04-05 08:52:02 08:52:41 00:05:12
2 b 2020-04-05 08:53:02 08:54:33 00:05:12
3 b 2020-04-05 08:55:24 08:57:14 00:05:12
4 b 2020-04-06 18:58:03 19:01:07 00:06:05
5 b 2020-04-06 19:03:05 19:04:08 00:06:05
6 c 2020-12-15 17:04:28 17:09:48 00:05:20
7 d 2020-06-23 21:31:23 21:50:12 00:18:49
You could try this -
my_df['arriving_time'] = pd.to_datetime(my_df['arriving_time'])
my_df['leaving_time'] = pd.to_datetime(my_df['leaving_time'])
my_df['travel_time'] = my_df.groupby(['id', 'date'])['leaving_time'].transform('max') - my_df.groupby(['id', 'date'])['arriving_time'].transform('min')
my_df
id date arriving_time leaving_time travel_time
0 a 2020/02/03 2020-03-19 14:36:06 2020-03-19 14:40:05 00:03:59
1 b 2020/04/05 2020-03-19 08:52:02 2020-03-19 08:52:41 00:05:12
2 b 2020/04/05 2020-03-19 08:53:02 2020-03-19 08:54:33 00:05:12
3 b 2020/04/05 2020-03-19 08:55:24 2020-03-19 08:57:14 00:05:12
4 b 2020/04/06 2020-03-19 18:58:03 2020-03-19 19:01:07 00:06:05
5 b 2020/04/06 2020-03-19 19:03:05 2020-03-19 19:04:08 00:06:05
6 c 2020/12/15 2020-03-19 17:04:28 2020-03-19 17:09:48 00:05:20
7 d 2020/06/23 2020-03-19 21:31:23 2020-03-19 21:50:12 00:18:49
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With