Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inferring date format versus passing a parser

Pandas internals question: I've been surprised to find a few times that explicitly passing a callable to date_parser within pandas.read_csv results in much slower read time than simply using infer_datetime_format=True.

Why is this? Will timing differences between these two options be date-format-specific, or what other factors will influence their relative timing?

In the below case, infer_datetime_format=True takes one-tenth the time of passing a date parser with a specified format. I would have naively assumed the latter would be faster because it's explicit.

The docs do note,

[if True,] pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.

but there's not much detail given and I was unable to work my way fully through the source.

Setup:

from io import StringIO

import numpy as np
import pandas as pd

np.random.seed(444)
dates = pd.date_range('1980', '2018')
df = pd.DataFrame(np.random.randint(0, 100, (len(dates), 2)),
                  index=dates).add_prefix('col').reset_index()

# Something reproducible to be read back in
buf = StringIO()
df.to_string(buf=buf, index=False)

def read_test(**kwargs):
    # Not ideal for .seek() to eat up runtime, but alleviate
    # this with more loops than needed in timing below
    buf.seek(0)
    return pd.read_csv(buf, sep='\s+', parse_dates=['index'], **kwargs)

# dateutil.parser.parser called in this case, according to docs
%timeit -r 7 -n 100 read_test()
18.1 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit -r 7 -n 100 read_test(infer_datetime_format=True)
19.8 ms ± 516 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Doesn't change with native Python datetime.strptime either
%timeit -r 7 -n 100 read_test(date_parser=lambda dt: pd.datetime.strptime(dt, '%Y-%m-%d'))
187 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

I'm interested in knowing a bit about what is going on internally with infer to give it this advantage. My old understanding was that there was already some type of inference going on in the first place because dateutil.parser.parser is used if neither is passed.


Update: did some digging on this but haven't been able to answer the question.

read_csv() calls a helper function which in turn calls pd.core.tools.datetimes.to_datetime(). That function (accessible as just pd.to_datetime()) has both an infer_datetime_format and a format argument.

However, in this case, the relative timings are very different and don't reflect the above:

s = pd.Series(['3/11/2000', '3/12/2000', '3/13/2000']*1000)

%timeit pd.to_datetime(s,infer_datetime_format=True)
19.8 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.to_datetime(s,infer_datetime_format=False)
1.01 s ± 65.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# This was taking the longest with i/o functions,
# now it's behaving "as expected"
%timeit pd.to_datetime(s,format='%m/%d/%Y')
19 ms ± 373 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
like image 473
Brad Solomon Avatar asked Feb 11 '18 04:02

Brad Solomon


People also ask

What does it mean to parse dates?

Parsing dates allows a user to format a Date input from a Form to the local standard format. The Parse Date with Format step converts a String into a DateTime data format. The Get Formatted Date step will convert a DateTime data input into a String.

How does Python determine date format?

DateTime Format Codes Here are some of the most common: %Y — This is used to represent the Year and it ranges from 0001 to 9999. %m — This is used to represent the month of a year and it ranges from 01 to 12. %d — This is used to represent the days of the month and ranges from 01 to 31.

What does parse date do in pandas?

By default, date columns are represented as object when loading data from a CSV file. To read the date column correctly, we can use the argument parse_dates to specify a list of date columns.

What is this datetime format?

The DATETIME type is used for values that contain both date and time parts. MySQL retrieves and displays DATETIME values in ' YYYY-MM-DD hh:mm:ss ' format. The supported range is '1000-01-01 00:00:00' to '9999-12-31 23:59:59' . The TIMESTAMP data type is used for values that contain both date and time parts.


2 Answers

You've identified the two important functions: read_csv prepares a function to parse the date columns using _make_date_converter, and this is always going make a call to_datetime (pandas' primary string-to-date conversion tool).

The answers by @WillAyd and @bmbigbang both seem correct to me in that they identify the cause of the slowness as repeated calls of the lambda function.

Since you ask for more details about pandas' source code, I'll try and examine each read_test call in more detail below to find out where we end up in to_datetime and ultimately why the timings are as you observed for your data.


read_test()

This is very fast because, without any prompts about a possible date format, pandas is going to try and parse the list-like column of strings as though they're approximately in ISO8601 format (which is a very common case).

Plunging into to_datetime, we quickly reach this code branch:

if result is None and (format is None or infer_datetime_format):
    result = tslib.array_to_datetime(...)

From here on, it's compiled Cython code all the way.

array_to_datetime iterates through the column of strings to convert each one to datetime format. For each row, we hit _string_to_dts at this line; then we go to another short snippet of inlined code (_cstring_to_dts) which means parse_iso_8601_datetime is called to do the actual parsing of the string to a datetime object.

This function is more than capable of parsing dates in the YYYY-MM-DD format and so there is just some housekeeping to finish the job (the C struct filled by parse_iso_8601_datetime becomes a proper datetime object, some bounds are checked).

As you can see, dateutil.parser.parser is not called at all.


read_test(infer_datetime_format=True)

Let's see why this is almost as fast as read_test().

Asking pandas to infer the datetime format (and passing no format argument) means we land here in to_datetime:

if infer_datetime_format and format is None:
    format = _guess_datetime_format_for_array(arg, dayfirst=dayfirst)

This calls _guess_datetime_format_for_array, which takes the first non-null value in the column and gives it to _guess_datetime_format. This tries to build a datetime format string to use for future parsing. (My answer here has more detail above the formats it is able recognise.)

Fortunately, the YYYY-MM-DD format is one that can be recognised by this function. Even more fortunately, this particular format has a fast-path through the pandas code!

You can see pandas sets infer_datetime_format back to False here:

if format is not None:
    # There is a special fast-path for iso8601 formatted
    # datetime strings, so in those cases don't use the inferred
    # format because this path makes process slower in this
    # special case
    format_is_iso8601 = _format_is_iso(format)
    if format_is_iso8601:
        require_iso8601 = not infer_datetime_format
        format = None

This allows the code to take the same path as above to the parse_iso_8601_datetime function.


read_test(date_parser=lambda dt: strptime(dt, '%Y-%m-%d'))

We've provided a function to parse the date with, so pandas executes this code block.

However, this raises as exception internally:

strptime() argument 1 must be str, not numpy.ndarray

This exception is immediately caught, and pandas falls back to using try_parse_dates before calling to_datetime.

try_parse_dates means that instead of being called on an array, the lambda function is called repeatedly for each value of the array in this loop:

for i from 0 <= i < n:
    if values[i] == '':
        result[i] = np.nan
    else:
        result[i] = parse_date(values[i]) # parse_date is the lambda function

Despite being compiled code, we pay the penalty of having function calls to Python code. This makes it very slow in comparison to the other approaches above.

Back in to_datetime, we now have an object array filled with datetime objects. Again we hit array_to_datetime, but this time pandas sees a date object and uses another function (pydate_to_dt64) to make it into a datetime64 object.

The cause of the slowdown is really due to the repeated calls to the lambda function.


About your update and the MM/DD/YYYY format

The Series s has date strings in the MM/DD/YYYY format.

This is not an ISO8601 format. pd.to_datetime(s, infer_datetime_format=False) tries to parse the string using parse_iso_8601_datetime but this fails with a ValueError. The error is handled here: pandas is going to use parse_datetime_string instead. This means that dateutil.parser.parse is used to convert the string to datetime. This is why it is slow in this case: repeated use of a Python function in a loop.

There's not much difference between pd.to_datetime(s, format='%m/%d/%Y') and pd.to_datetime(s, infer_datetime_format=True) in terms of speed. The latter uses _guess_datetime_format_for_array again to infer the MM/DD/YYYY format. Both then hit array_strptime here:

if format is not None:
    ...
    if result is None:
        try:
            result = array_strptime(arg, format, exact=exact, errors=errors)

array_strptime is a fast Cython function for parsing an array of strings to datetime structs given a specific format.

like image 82
Alex Riley Avatar answered Nov 04 '22 12:11

Alex Riley


infer_datetime_format applies the format it deduces to all elements in a single pass (i.e. in a vectorized manner). By contrast, your lambda function gets called for each element, bringing with it a lot more call overhead and decreased performance.

There is a request from 2012 for an explicit format keyword argument, which would theoretically give you the functionality you are looking for with the performance you desire. In lieu of that being implemented, your best bet is to go with Wes's suggested approach in the link and simply read the date in as a string, calling pd.to_datetime after the fact

Here are sample times from my machine to illustrate:

%timeit read_test()
15.4 ms ± 96.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit read_test(infer_datetime_format=True)
17.2 ms ± 1.82 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit read_test(date_parser=lambda dt: pd.datetime.strptime(dt, '%Y-%m-%d'))
147 ms ± 4.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df = read_test(); df['index'] = pd.to_datetime(df['index'], '%Y-%m-%d')
15.3 ms ± 239 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 22
Will Ayd Avatar answered Nov 04 '22 11:11

Will Ayd