Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas read_csv: parsing time field correctly

Tags:

python

pandas

I have the following raw data,

TranID,TranDate,TranTime,TranAmt
A123456,20160427,02:18,9999.53
B123457,20160426,02:48,26070.33
C123458,20160425,03:18,13779.56
A123459,20160424,03:18,18157.26
B123460,20160423,04:18,215868.15
C123461,20160422,04:18,23695.25
A123462,20160421,05:18,57
B123463,20160420,05:18,64594.24
C123464,20160419,06:18,47890.91
A123465,20160427,06:18,14119.74
B123466,20160426,07:18,2649.6
C123467,20160425,07:18,16757.38
A123468,20160424,08:18,8864.78
B123469,20160423,08:18,26254.69
C123470,20160422,09:18,13206.98
A123471,20160421,09:18,15872.45
B123472,20160420,10:18,197621.18
C123473,20160419,10:18,21048.72

and I tried importing the raw data using pd read_csv,

Try1

import numpy as np
import pandas as pd

df = pd.read_csv('MyTest.csv', sep=',', header=0, parse_dates=['TranDate'],
                     usecols=['TranID','TranDate','TranTime','TranAmt'],
                     engine='python')
print(df.dtypes)
df[:5]

Output1

TranID              object
TranDate    datetime64[ns]
TranTime            object
TranAmt            float64
dtype: object
Out[12]:
TranID  TranDate    TranTime    TranAmt
0   A123456 2016-04-27  02:18   9999.53
1   B123457 2016-04-26  02:48   26070.33
2   C123458 2016-04-25  03:18   13779.56
3   A123459 2016-04-24  03:18   18157.26
4   B123460 2016-04-23  04:18   215868.15

Try2

import numpy as np
import pandas as pd

df = pd.read_csv('MyTest.csv', sep=',', header=0, parse_dates=['TranDate', 'TranTime'],
                 usecols=['TranID','TranDate','TranTime','TranAmt'],
                 engine='python')
print(df.dtypes)
df[:5]

Output2

TranID              object
TranDate    datetime64[ns]
TranTime    datetime64[ns]
TranAmt            float64
dtype: object
Out[13]:
TranID  TranDate    TranTime    TranAmt
0   A123456 2016-04-27  2016-04-27 02:18:00 9999.53
1   B123457 2016-04-26  2016-04-27 02:48:00 26070.33
2   C123458 2016-04-25  2016-04-27 03:18:00 13779.56
3   A123459 2016-04-24  2016-04-27 03:18:00 18157.26
4   B123460 2016-04-23  2016-04-27 04:18:00 215868.15

My confusions are with the TranTime column. In Try1, it is displayed correctly but dtype is object. In Try2, pd added current date to the time and the dtype is datetime.

I want this TranTime column to be treated as Time and want to perform aggregations using pd's groupby or pivot_table. If I use Try1 method, does the object dtype affect my aggregations? If I use Try2 method, do I need to strip the Date part out in order to use the Time part?

I am proficient in SAS where SAS has date, time and datetime informats and formats where the underlying data type is just numeric. Hence my confusion with Python's object and datetime dtypes.

Thanks, Lobbie

like image 725
Lobbie Avatar asked Mar 12 '23 11:03

Lobbie


2 Answers

In Python, datetimes are generally represented as datetime.datetime objects. These are not very efficient, which is why Pandas uses Timestamps, which are numeric.

To read the data (note the double brackets around the parse_dates arguments):

df = pd.read_csv(filename, parse_dates=[['TranDate', 'TranTime']])

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 18 entries, 0 to 17
Data columns (total 3 columns):
TranDate_TranTime    18 non-null datetime64[ns]
TranID               18 non-null object
TranAmt              18 non-null float64
dtypes: datetime64[ns](1), float64(1), object(1)

>>> df.head()
    TranDate_TranTime   TranID    TranAmt
0 2016-04-27 02:18:00  A123456    9999.53
1 2016-04-26 02:48:00  B123457   26070.33
2 2016-04-25 03:18:00  C123458   13779.56
3 2016-04-24 03:18:00  A123459   18157.26
4 2016-04-23 04:18:00  B123460  215868.15

The date and time columns have been joined to just one columns. Once you have this timestamp, it is easy to access its attributes using the dt accessor, e.g.

>>> df.groupby(df.TranDate_TranTime.dt.hour).TranAmt.sum().head()
TranDate_TranTime
2     36069.86
3     31936.82
4    239563.40
5     64651.24
6     62010.65
Name: TranAmt, dtype: float64

>>> df.groupby(df.TranDate_TranTime.dt.day).TranAmt.sum().head()
TranDate_TranTime
19     68939.63
20    262215.42
21     15929.45
22     36902.23
23    242122.84
Name: TranAmt, dtype: float64

Refer to the Pandas docs for more information regarding Pandas date functionality.

like image 125
Alexander Avatar answered Mar 30 '23 10:03

Alexander


  1. No aggregation affection, but you will lose the time part.
  2. No, mostly you can access the time part by .dt accessor.
import pandas as pd

df = pd.read_csv('MyTest.csv', parse_dates=[['TranDate', 'TranTime']])
print df

TranDate_TranTime   TranID    TranAmt
0  2016-04-27 02:18:00  A123456    9999.53
1  2016-04-26 02:48:00  B123457   26070.33
2  2016-04-25 03:18:00  C123458   13779.56
3  2016-04-24 03:18:00  A123459   18157.26
4  2016-04-23 04:18:00  B123460  215868.15
5  2016-04-22 04:18:00  C123461   23695.25
6  2016-04-21 05:18:00  A123462      57.00
7  2016-04-20 05:18:00  B123463   64594.24
8  2016-04-19 06:18:00  C123464   47890.91
9  2016-04-27 06:18:00  A123465   14119.74
10 2016-04-26 07:18:00  B123466    2649.60
11 2016-04-25 07:18:00  C123467   16757.38
12 2016-04-24 08:18:00  A123468    8864.78
13 2016-04-23 08:18:00  B123469   26254.69
14 2016-04-22 09:18:00  C123470   13206.98
15 2016-04-21 09:18:00  A123471   15872.45
16 2016-04-20 10:18:00  B123472  197621.18
17 2016-04-19 10:18:00  C123473   21048.72

Parse and manage the date/time as one column as far as possible using nested bracket parse_dates=[[]].

print df.groupby(df.TranDate_TranTime.dt.hour).sum()

      TranAmt
2    36069.86
3    31936.82
4   239563.40
5    64651.24
6    62010.65
7    19406.98
8    35119.47
9    29079.43
10  218669.90

print df.groupby(df.TranDate_TranTime.dt.minute).sum()

      TranAmt
18  710437.42
48   26070.33

Get what you want like above.

And you can still groupby after resampling like below.

df2 = df.set_index('TranDate_TranTime').resample('60s').sum().dropna()
print df2

                       TranAmt
TranDate_TranTime             
2016-04-19 06:18:00   47890.91
2016-04-19 10:18:00   21048.72
2016-04-20 05:18:00   64594.24
2016-04-20 10:18:00  197621.18
2016-04-21 05:18:00      57.00
2016-04-21 09:18:00   15872.45
2016-04-22 04:18:00   23695.25
2016-04-22 09:18:00   13206.98
2016-04-23 04:18:00  215868.15
2016-04-23 08:18:00   26254.69
2016-04-24 03:18:00   18157.26
2016-04-24 08:18:00    8864.78
2016-04-25 03:18:00   13779.56
2016-04-25 07:18:00   16757.38
2016-04-26 02:48:00   26070.33
2016-04-26 07:18:00    2649.60
2016-04-27 02:18:00    9999.53
2016-04-27 06:18:00   14119.74

print df2.groupby(df2.index.day).sum()

      TranAmt
19   68939.63
20  262215.42
21   15929.45
22   36902.23
23  242122.84
24   27022.04
25   30536.94
26   28719.93
27   24119.27
like image 29
su79eu7k Avatar answered Mar 30 '23 09:03

su79eu7k