Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why use infer_datetime_format when importing csv file?

Where is the process difference between:

df=pd.read_csv(filename, parse_dates=[0], infer_datetime_format=True)

and

df=pd.read_csv(filename, parse_dates=[0])

Why is the first import to be faster? Since parse_dates already specifies where to look for a date.

like image 741
rul30 Avatar asked May 17 '26 19:05

rul30


1 Answers

The docs for pandas.read_csv suggest why:

infer_datetime_format : boolean, default False

If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.

Essentially, Pandas deduces the format of your datetime from the first element(s) and then assumes all other elements in the series will use the same format. This means Pandas does not need to check multiple formats when attempting to convert a string to datetime.

Remember, CSV files can only hold textual data, so a conversion to datetime (essentially a numeric type) will always be required.

Here's a demonstration:

from dateutil import parser
from datetime import datetime

L = ['2018-01-05', '2018-12-20', '2018-03-30', '2018-04-15']*5000

%timeit [parser.parse(i) for i in L]                   # 1.57 s
%timeit [datetime.strptime(i, '%Y-%m-%d') for i in L]  # 338 ms
like image 142
jpp Avatar answered May 20 '26 09:05

jpp