Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to improve performance on a lambda function on a massive dataframe

I have a df with over hundreds of millions of rows.

     latitude                    longitude                   time                    VAL     
0   -39.20000076293945312500    140.80000305175781250000    1972-01-19 13:00:00     1.20000004768371582031  
1   -39.20000076293945312500    140.80000305175781250000    1972-01-20 13:00:00     0.89999997615814208984 
2   -39.20000076293945312500    140.80000305175781250000    1972-01-21 13:00:00     1.50000000000000000000 
3   -39.20000076293945312500    140.80000305175781250000    1972-01-22 13:00:00     1.60000002384185791016 
4   -39.20000076293945312500    140.80000305175781250000    1972-01-23 13:00:00     1.20000004768371582031
... ...

It contains a time column with the type of datetime64 in UTC. The following code is to create a new column isInDST to indicate if the time is in daylight saving period in a local time zone.

df['isInDST'] = pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria').map(lambda x : x.dst().total_seconds()!=0)

It takes about 400 seconds to process 15,223,160 rows.

Is there a better approach to achieve this with better performance? Is vectorize a better way?

like image 597
alextc Avatar asked Apr 08 '21 00:04

alextc


1 Answers

All results are calculated on 1M datapoints.

Cython + np.vectorize

7.2 times faster than the original code

%%cython
from cpython.datetime cimport datetime
cpdef bint c_is_in_dst(datetime dt):
    return dt.dst().total_seconds() != 0 

%%timeit
df['isInDST'] = np.vectorize(c_is_in_dst)(df['time'].dt.tz_localize('UTC').dt.tz_convert('Australia/Victoria').dt.to_pydatetime())

1.08 s ± 10.2 ms per loop per loop

np.vectorize

6.5 times faster than the original code

def is_in_dst(dt):
    return dt.dst().total_seconds() != 0 

%%timeit
df['isInDST'] = np.vectorize(is_in_dst)(df['time'].dt.tz_localize('UTC').dt.tz_convert('Australia/Victoria').dt.to_pydatetime())

1.2 s ± 29.3 ms per loop per loop

Based on the documentation (The implementation is essentially a for loop) I expected the result to be the same as for the list comprehension, but it's consistently a little bit better than list comprehension.

List comprehension

5.9 times faster than the original code

%%timeit
df['isInDST'] = [x.dst().total_seconds()!=0 for x in pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria')]

1.33 s ± 48.4 ms per loop

This result shows that pandas map/apply is very slow, it adds additional overhead that can be eliminated by just using a python for loop.

Original approach (map on pandas DatetimeIndex)

%%timeit
df['isInDST'] = pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria').map(lambda x : x.dst().total_seconds()!=0)

7.82 s ± 84.3 ms per loop

Tested on 1M rows of dummy data

N = 1_000_000
df = pd.DataFrame({"time": [datetime.datetime.now().replace(hour=random.randint(0,23),minute=random.randint(0,59)) for _ in range(N)]})

Also, run the code on 100K and 10M rows - the results are linearly dependant on the number of rows

like image 123
PermanentPon Avatar answered Nov 10 '22 06:11

PermanentPon