Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse dates when YYYYMMDD and HH are in separate columns using pandas in Python

Tags:

python

pandas

I have a simple question related with csv files and parsing datetime.

I have a csv file that look like this:

YYYYMMDD, HH,    X 20110101,  1,   10 20110101,  2,   20 20110101,  3,   30 

I would like to read it using pandas (read_csv) and have it in a dataframe indexed by the datetime. So far I've tried to implement the following:

import pandas as pnd pnd.read_csv("..\\file.csv",  parse_dates = True, index_col = [0,1]) 

and the result I get is:

                         X YYYYMMDD    HH             2011-01-01 2012-07-01   10            2012-07-02   20            2012-07-03   30 

As you see the parse_dates in converting the HH into a different date.

Is there a simple and efficient way to combine properly the column "YYYYMMDD" with the column "HH" in order to have something like this? :

                      X Datetime               2011-01-01 01:00:00  10 2011-01-01 02:00:00  20 2011-01-01 03:00:00  30 

Thanks in advance for the help.

like image 358
Mauricio Avatar asked Jul 23 '12 15:07

Mauricio


People also ask

What does parse dates do in pandas?

We can use the parse_dates parameter to convince pandas to turn things into real datetime types. parse_dates takes a list of columns (since you could want to parse multiple columns into datetimes ).

What is parsing in pandas?

Parsing of JSON Dataset using pandas is much more convenient. Pandas allow you to convert a list of lists into a Dataframe and specify the column names separately.

How do I read a CSV file from datetime in Python?

To automatically read dates from a CSV file with Python Pandas, we can set the date_parser argument. to call read_csv with the file to read. And we set parse_dates to 'datetime' to parse dates with datetime .


2 Answers

If you pass a list to index_col, it means you want to create a hierarchical index out of the columns in the list.

In addition, the parse_dates keyword can be set to either True or a list/dict. If True, then it tries to parse individual columns as dates, otherwise it combines columns to parse a single date column.

In summary, what you want to do is:

from datetime import datetime import pandas as pd parse = lambda x: datetime.strptime(x, '%Y%m%d %H') pd.read_csv("..\\file.csv",  parse_dates = [['YYYYMMDD', 'HH']],              index_col = 0,              date_parser=parse) 
like image 120
Chang She Avatar answered Sep 28 '22 21:09

Chang She


I am doing this all the time, so I tested different ways for speed. The fastest I found is the following, approx. 3 times faster than Chang She's solution, at least in my case, when taking the total time of file parsing and date parsing into account:

First, parse the data file using pd.read_csv withOUT parsing dates. I find that it is slowing down the file-reading quite a lot. Make sure that the columns of the CSV file are now columns in the dataframe df. Then:

format = "%Y%m%d %H" times = pd.to_datetime(df.YYYYMMDD + ' ' + df.HH, format=format) df.set_index(times, inplace=True) # and maybe for cleanup df = df.drop(['YYYYMMDD','HH'], axis=1) 
like image 35
K.-Michael Aye Avatar answered Sep 28 '22 22:09

K.-Michael Aye