I have a .csv file in such format
timestmp, p 2014/12/31 00:31:01:9200, 0.7 2014/12/31 00:31:12:1700, 1.9 ...
and when read via pd.read_csv
and convert the time str to datetime using pd.to_datetime
, the performance drops dramatically. Here is a minimal example.
import re import pandas as pd d = '2014-12-12 01:02:03.0030' c = re.sub('-', '/', d) %timeit pd.to_datetime(d) %timeit pd.to_datetime(c) %timeit pd.to_datetime(c, format="%Y/%m/%d %H:%M:%S.%f")
and the performances are:
10000 loops, best of 3: 62.4 µs per loop 10000 loops, best of 3: 181 µs per loop 10000 loops, best of 3: 82.9 µs per loop
so, how could I improve the performance of pd.to_datetime
when reading date from a csv file?
Pandas has a built-in function called to_datetime()that converts date and time in string format to a DateTime object. As you can see, the 'date' column in the DataFrame is currently of a string-type object. Thus, to_datetime() converts the column to a series of the appropriate datetime64 dtype.
By default pandas datetime format is YYYY-MM-DD ( %Y-%m-%d ).
If True , the function always returns a timezone-aware UTC-localized Timestamp , Series or DatetimeIndex . To do this, timezone-naive inputs are localized as UTC, while timezone-aware inputs are converted to UTC. If False (default), inputs will not be coerced to UTC.
This is because pandas falls back to dateutil.parser.parse
for parsing the strings when it has a non-default format or when no format
string is supplied (this is much more flexible, but also slower).
As you have shown above, you can improve the performance by supplying a format
string to to_datetime
. Or another option is to use infer_datetime_format=True
Apparently, the infer_datetime_format
cannot infer when there are microseconds. With an example without those, you can see a large speed-up:
In [28]: d = '2014-12-24 01:02:03' In [29]: c = re.sub('-', '/', d) In [30]: s_c = pd.Series([c]*10000) In [31]: %timeit pd.to_datetime(s_c) 1 loops, best of 3: 1.14 s per loop In [32]: %timeit pd.to_datetime(s_c, infer_datetime_format=True) 10 loops, best of 3: 105 ms per loop In [33]: %timeit pd.to_datetime(s_c, format="%Y/%m/%d %H:%M:%S") 10 loops, best of 3: 99.5 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With