Where is the process difference between:
df=pd.read_csv(filename, parse_dates=[0], infer_datetime_format=True)
and
df=pd.read_csv(filename, parse_dates=[0])
Why is the first import to be faster? Since parse_dates already specifies where to look for a date.
The docs for pandas.read_csv suggest why:
infer_datetime_format : boolean, default False
If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.
Essentially, Pandas deduces the format of your datetime from the first element(s) and then assumes all other elements in the series will use the same format. This means Pandas does not need to check multiple formats when attempting to convert a string to datetime.
Remember, CSV files can only hold textual data, so a conversion to datetime (essentially a numeric type) will always be required.
Here's a demonstration:
from dateutil import parser
from datetime import datetime
L = ['2018-01-05', '2018-12-20', '2018-03-30', '2018-04-15']*5000
%timeit [parser.parse(i) for i in L] # 1.57 s
%timeit [datetime.strptime(i, '%Y-%m-%d') for i in L] # 338 ms
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With