Why use infer_datetime_format when importing csv file?

Question

Where is the process difference between:

df=pd.read_csv(filename, parse_dates=[0], infer_datetime_format=True)

and

df=pd.read_csv(filename, parse_dates=[0])

Why is the first import to be faster? Since parse_dates already specifies where to look for a date.

jpp · Accepted Answer

The docs for pandas.read_csv suggest why:

infer_datetime_format : boolean, default False

If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.

Essentially, Pandas deduces the format of your datetime from the first element(s) and then assumes all other elements in the series will use the same format. This means Pandas does not need to check multiple formats when attempting to convert a string to datetime.

Remember, CSV files can only hold textual data, so a conversion to datetime (essentially a numeric type) will always be required.

Here's a demonstration:

from dateutil import parser
from datetime import datetime

L = ['2018-01-05', '2018-12-20', '2018-03-30', '2018-04-15']*5000

%timeit [parser.parse(i) for i in L]                   # 1.57 s
%timeit [datetime.strptime(i, '%Y-%m-%d') for i in L]  # 338 ms

Why use infer_datetime_format when importing csv file?

Tags:

performance

python

datetime

pandas

csv

rul30

1 Answers

jpp

Recent Activity

Donate For Us

Why use infer_datetime_format when importing csv file?

Tags:

performance

python

datetime

pandas

csv

rul30

1 Answers

jpp

Related questions

Recent Activity

Donate For Us