Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using python pandas to parse CSV with date in format Year, Day, Hour, Min, Sec

I have several CSV files with the format:

Year,Day,Hour,Min,Sec.,P1'S1
 2003,  1, 0, 0,12.22, 0.541
 2003,  1, 1, 0,20.69, 0.708
 2003,  1, 2, 0, 4.95, 0.520
 2003,  1, 3, 0,13.42, 0.539
...

(where day, is the day of the year) and I'm trying to read them using the pandas library (seems a fantastic lib so far).

There is a built-in function to read CSV in pandas, and even better, that function supposedly checks the columns for a date type. and automatically uses that as an index (which would be exactly perfect for what I'm doing).

The thing is, I cannot get it to work with date data in this format.

I tried:

data = pd.read_csv("csvFile.csv", index_col=[0, 1],  , index_col=[0, 1, 2, 3, 4] parse_dates=True)

but it only gets the year correctly:

In [36]: data.index
Out[36]: 
MultiIndex
[(<Timestamp: 2003-09-04 00:00:00>, 1, 0, 0, 12.22)
 (<Timestamp: 2003-09-04 00:00:00>, 1, 1, 0, 20.69)
 (<Timestamp: 2003-09-04 00:00:00>, 1, 2, 0, 4.95) ...,
 (<Timestamp: 2003-09-04 00:00:00>, 365, 21, 0, 3.77)
 (<Timestamp: 2003-09-04 00:00:00>, 365, 22, 0, 14.6)
 (<Timestamp: 2003-09-04 00:00:00>, 365, 23, 0, 13.36)]

From the documentation, I see that you can specify the "date_parser" attribute in the read_csv function of pandas. But the documentation doesn't show how and I'm not being able to figure it out. Anyone with experience in the subject that can give a hand.

Cheers, Bruno

like image 595
jbssm Avatar asked Sep 04 '12 18:09

jbssm


People also ask

How do you auto detect the date datetime columns and set their datatype when reading a csv file in pandas?

You should add parse_dates=True , or parse_dates=['column name'] when reading, thats usually enough to magically parse it.

How do I read a CSV file from datetime in Python?

To automatically read dates from a CSV file with Python Pandas, we can set the date_parser argument. to call read_csv with the file to read. And we set parse_dates to 'datetime' to parse dates with datetime .


1 Answers

In order to parse a multi-column date, you need to tell pandas which columns should be combined into a single date, so you need to say parse_dates=['Year','Day','Hour','Min','Sec']

You also need to define your own parser that takes a element from each column you specified in parse_dates:

In [1]: import pandas as pd

In [2]: from datetime import datetime, timedelta

In [3]: from cStringIO import StringIO

In [4]: data = """\
Year,Day,Hour,Min,Sec.,P1'S1
 2003,  1, 0, 0,12.22, 0.541
 2003,  1, 1, 0,20.69, 0.708
 2003,  1, 2, 0, 4.95, 0.520
 2003,  1, 3, 0,13.42, 0.539
"""

In [5]: def parse(yr, doy, hr, min, sec):
    yr, doy, hr, min = [int(x) for x in [yr, doy, hr, min]]
    sec = float(sec)
    mu_sec = int((sec - int(sec)) * 1e6)
    sec = int(sec)
    dt = datetime(yr - 1, 12, 31)
    delta = timedelta(days=doy, hours=hr, minutes=min, seconds=sec,
                      microseconds=mu_sec)
    return dt + delta
   ...: 

In [6]: pd.read_csv(StringIO(data), parse_dates={'datetime':      
           ['Year','Day','Hour','Min','Sec.']}, 
           date_parser=parse, index_col='datetime')
Out[6]: 
                            P1'S1
datetime                         
2003-01-01 00:00:12.220000  0.541
2003-01-01 01:00:20.690000  0.708
2003-01-01 02:00:04.950000  0.520
2003-01-01 03:00:13.419999  0.539
like image 143
Chang She Avatar answered Oct 07 '22 03:10

Chang She