Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is pandas.to_datetime slow for non standard time format such as '2014/12/31'

I have a .csv file in such format

timestmp, p 2014/12/31 00:31:01:9200, 0.7 2014/12/31 00:31:12:1700, 1.9 ... 

and when read via pd.read_csv and convert the time str to datetime using pd.to_datetime, the performance drops dramatically. Here is a minimal example.

import re import pandas as pd  d = '2014-12-12 01:02:03.0030' c = re.sub('-', '/', d)  %timeit pd.to_datetime(d) %timeit pd.to_datetime(c) %timeit pd.to_datetime(c, format="%Y/%m/%d %H:%M:%S.%f") 

and the performances are:

10000 loops, best of 3: 62.4 µs per loop 10000 loops, best of 3: 181 µs per loop 10000 loops, best of 3: 82.9 µs per loop 

so, how could I improve the performance of pd.to_datetime when reading date from a csv file?

like image 211
liubenyuan Avatar asked Aug 16 '15 11:08

liubenyuan


People also ask

How do pandas deal with date time?

Pandas has a built-in function called to_datetime()that converts date and time in string format to a DateTime object. As you can see, the 'date' column in the DataFrame is currently of a string-type object. Thus, to_datetime() converts the column to a series of the appropriate datetime64 dtype.

What is pandas datetime format?

By default pandas datetime format is YYYY-MM-DD ( %Y-%m-%d ).

What is UTC true in pandas?

If True , the function always returns a timezone-aware UTC-localized Timestamp , Series or DatetimeIndex . To do this, timezone-naive inputs are localized as UTC, while timezone-aware inputs are converted to UTC. If False (default), inputs will not be coerced to UTC.


1 Answers

This is because pandas falls back to dateutil.parser.parse for parsing the strings when it has a non-default format or when no format string is supplied (this is much more flexible, but also slower).

As you have shown above, you can improve the performance by supplying a format string to to_datetime. Or another option is to use infer_datetime_format=True


Apparently, the infer_datetime_format cannot infer when there are microseconds. With an example without those, you can see a large speed-up:

In [28]: d = '2014-12-24 01:02:03'  In [29]: c = re.sub('-', '/', d)  In [30]: s_c = pd.Series([c]*10000)  In [31]: %timeit pd.to_datetime(s_c) 1 loops, best of 3: 1.14 s per loop  In [32]: %timeit pd.to_datetime(s_c, infer_datetime_format=True) 10 loops, best of 3: 105 ms per loop  In [33]: %timeit pd.to_datetime(s_c, format="%Y/%m/%d %H:%M:%S") 10 loops, best of 3: 99.5 ms per loop 
like image 176
joris Avatar answered Oct 15 '22 09:10

joris