Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas to_datetime () function performance issues

Tags:

python

pandas

Have a df like that:

Dat
10/01/2016
11/01/2014
12/02/2013

The column 'Dat' has object type so I trying to switch it to datetime using to_datetime () pandas function that way:

to_datetime_rand = partial(pd.to_datetime, format='%m/%d/%Y')
df['DAT'] =   df['DAT'].apply(to_datetime_rand)

Everything works well but I have performance issues when my df is higher than 2 billion rows. So in that case this method stucks and does not work well.

Does pandas to_datetime () function has an ability to do the convertation by chuncks or maybe iterationally by looping.

Thanks.

like image 968
Keithx Avatar asked Dec 02 '22 13:12

Keithx


1 Answers

If performance is a concern I would advise to use the following function to convert those columns to date_time:

def lookup(s):
    """
    This is an extremely fast approach to datetime parsing.
    For large data, the same dates are often repeated. Rather than
    re-parse these, we store all unique dates, parse them, and
    use a lookup to convert all dates.
    """
    dates = {date:pd.to_datetime(date) for date in s.unique()}
    return s.apply(lambda v: dates[v])
to_datetime: 5799 ms
dateutil:    5162 ms
strptime:    1651 ms
manual:       242 ms
lookup:        32 ms
like image 154
SerialDev Avatar answered Dec 28 '22 15:12

SerialDev