I'm reading a huge CSV
with a date field in the format YYYYMMDD
and I'm using the following lambda to convert it when reading:
import pandas as pd df = pd.read_csv(filen, index_col=None, header=None, parse_dates=[0], date_parser=lambda t:pd.to_datetime(str(t), format='%Y%m%d', coerce=True))
This function is very slow though.
Any suggestion to improve it?
Note: As @ritchie46's answer states, this solution may be redundant since pandas version 0.25 per the new argument cache_dates
that defaults to True
Try using this function for parsing dates:
def lookup(date_pd_series, format=None): """ This is an extremely fast approach to datetime parsing. For large data, the same dates are often repeated. Rather than re-parse these, we store all unique dates, parse them, and use a lookup to convert all dates. """ dates = {date:pd.to_datetime(date, format=format) for date in date_pd_series.unique()} return date_pd_series.map(dates)
Use it like:
df['date-column'] = lookup(df['date-column'], format='%Y%m%d')
Benchmarks:
$ python date-parse.py to_datetime: 5799 ms dateutil: 5162 ms strptime: 1651 ms manual: 242 ms lookup: 32 ms
Source: https://github.com/sanand0/benchmarks/tree/master/date-parse
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With