Pandas: slow date conversion

Question

I'm reading a huge CSV with a date field in the format YYYYMMDD and I'm using the following lambda to convert it when reading:

import pandas as pd  df = pd.read_csv(filen,                  index_col=None,                  header=None,                  parse_dates=[0],                  date_parser=lambda t:pd.to_datetime(str(t),                                             format='%Y%m%d', coerce=True))

This function is very slow though.

Any suggestion to improve it?

fixxxer · Accepted Answer

Note: As @ritchie46's answer states, this solution may be redundant since pandas version 0.25 per the new argument cache_dates that defaults to True

Try using this function for parsing dates:

def lookup(date_pd_series, format=None):     """     This is an extremely fast approach to datetime parsing.     For large data, the same dates are often repeated. Rather than     re-parse these, we store all unique dates, parse them, and     use a lookup to convert all dates.     """     dates = {date:pd.to_datetime(date, format=format) for date in date_pd_series.unique()}     return date_pd_series.map(dates)

Use it like:

df['date-column'] = lookup(df['date-column'], format='%Y%m%d')

Benchmarks:

$ python date-parse.py to_datetime: 5799 ms dateutil:    5162 ms strptime:    1651 ms manual:       242 ms lookup:        32 ms

Source: https://github.com/sanand0/benchmarks/tree/master/date-parse

Pandas: slow date conversion

Tags:

python

optimization

pandas

ppaulojr

1 Answers

fixxxer

Recent Activity

Donate For Us

Pandas: slow date conversion

Tags:

python

optimization

pandas

ppaulojr

1 Answers

fixxxer

Related questions

Recent Activity

Donate For Us