I have the following file (df_SOF1.csv
), it is 1 million records long
Location,Transport,Transport1,DateOccurred,CostCentre,D_Time,count
0,Lorry,Car,07/09/2012,0,0:00:00,2
1,Lorry,Car,11/09/2012,0,0:00:00,5
2,Lorry,Car,14/09/2012,0,0:00:00,30
3,Lorry,Car,14/09/2012,0,0:07:00,2
4,Lorry,Car,14/09/2012,0,0:29:00,1
5,Lorry,Car,14/09/2012,0,3:27:00,3
6,Lorry,Car,14/09/2012,0,3:28:00,4
7,Lorry,Car,21/09/2012,0,0:00:00,13
8,Lorry,Car,27/09/2012,0,0:00:00,8
9,Lorry,Car,28/09/2012,0,0:02:00,1
10,Train,Bus,03/09/2012,2073,7:49:00,1
11,Train,Bus,05/09/2012,2073,7:50:00,1
12,Train,Bus,06/09/2012,2073,7:52:00,1
13,Train,Bus,07/09/2012,2073,7:48:00,1
14,Train,Bus,08/09/2012,2073,7:55:00,1
15,Train,Bus,11/09/2012,2073,7:49:00,1
16,Train,Bus,12/09/2012,2073,7:52:00,1
17,Train,Bus,13/09/2012,2073,7:50:00,1
18,Train,Bus,14/09/2012,2073,7:54:00,1
19,Train,Bus,18/09/2012,2073,7:51:00,1
20,Train,Bus,19/09/2012,2073,7:50:00,1
21,Train,Bus,20/09/2012,2073,7:51:00,1
22,Train,Bus,21/09/2012,2073,7:52:00,1
23,Train,Bus,22/09/2012,2073,7:53:00,1
24,Train,Bus,23/09/2012,2073,7:49:00,1
25,Train,Bus,24/09/2012,2073,7:54:00,1
26,Train,Bus,25/09/2012,2073,7:55:00,1
27,Train,Bus,26/09/2012,2073,7:53:00,1
28,Train,Bus,27/09/2012,2073,7:55:00,1
29,Train,Bus,28/09/2012,2073,7:53:00,1
30,Train,Bus,29/09/2012,2073,7:56:00,1
I am using pandas to analyse it I have been been trying for at least 40 hours
to find a way to group the data in a way that I can aggregate the time column D_Time
I have loaded the required modules
I create a dataframe see below using DateOccured
as an index
df_SOF1 = read_csv('/users/fabulous/documents/df_SOF1.csv', index_col=3, parse_dates=True) # read file from disk
I can group by any column or iterate through any row e.g.
df_SOF1.groupby('Location').sum()
However I have not found a way to sum up and take the mean of the D_Time
column using pandas. I have read over 20 articles on timedeltas etc but am still not the wiser how I do this in pandas.
Any solution that can allow me do arithmetic on the D_Time
column would be appreciated. (even if it has to be done outside of pandas).
I thought one possible solution would be to change the D_Time
column into seconds.
__________________________________2012/11/01
I ran the following command on the 30 items above
df_SOF1.groupby('Transport').agg({'D_Time': sum})
D_Time
Transport
Lorry 0:00:000:00:000:00:000:07:000:29:003:27:003:28...
Train 7:49:007:50:007:52:007:48:007:55:007:49:007:52..
It seems to sum the values together physically rather than give a numerical sum (like adding strings)
Cheers
I didn't find any mentions about deltatime in pandas, and datetime module has one, so to convert D_Time to seconds is not bad idea:
def seconds(time_str):
end_time = datetime.datetime.strptime(time_str,'%H:%M:%S')
delta = end_time - datetime.datetime.strptime('0:0:0','%H:%M:%S')
return delta.total_seconds()
df_SOF1.D_Time = df_SOF1.D_Time.apply(seconds)
result :
>>> df_SOF1.groupby('CostCentre').sum()
Location D_Time count
CostCentre
0 45 27180 69
2073 420 594660 21
moving datetime.datetime.strptime('0:0:0','%H:%M:%S') to global namespace can reduce exec time:
timeit.timeit("sec('01:01:01')", setup="from __main__ import sec",
number=10000)
1.025843858718872
timeit.timeit("seconds('01:01:01')", setup="from __main__ import seconds",
number=10000)
0.6128969192504883
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With