Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallelize pandas apply

New to pandas, I already want to parallelize a row-wise apply operation. So far I found Parallelize apply after pandas groupby However, that only seems to work for grouped data frames.

My use case is different: I have a list of holidays and for my current row/date want to find the no-of-days before and after this day to the next holiday.

This is the function I call via apply:

def get_nearest_holiday(x, pivot):
    nearestHoliday = min(x, key=lambda x: abs(x- pivot))
    difference = abs(nearesHoliday - pivot)
    return difference / np.timedelta64(1, 'D')

How can I speed it up?

edit

I experimented a bit with pythons pools - but it was neither nice code, nor did I get my computed results.

like image 997
Georg Heiler Avatar asked Sep 02 '16 05:09

Georg Heiler


People also ask

Can you parallelize pandas?

Using decomposition rules, Modin can efficiently leverage multiple cores on a machine or on a cluster to parallelize all pandas operations – leading to multiple orders of magnitude speedup.

Does pandas apply use multiple cores?

Operations on data frame using Pandas is slow, as it uses a single-core of CPU to perform the computations, and does not take advantage of a multi-core CPU. The modern laptop has a lot of cores that can be utilized parallelly to scale up the performance.

Is pandas apply faster than for loop?

As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead. Using map as a vectorized solution gives even faster results.

What is the difference between apply and Applymap in pandas?

apply() is used to apply a function along an axis of the DataFrame or on values of Series. applymap() is used to apply a function to a DataFrame elementwise. map() is used to substitute each value in a Series with another value.


1 Answers

For the parallel approach this is the answer based on Parallelize apply after pandas groupby:

from joblib import Parallel, delayed
import multiprocessing

def get_nearest_dateParallel(df):
    df['daysBeforeHoliday'] = df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day < x], x))
    df['daysAfterHoliday']  =  df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day > x], x))
    return df

def applyParallel(dfGrouped, func):
    retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)
    return pd.concat(retLst)

print ('parallel version: ')
# 4 min 30 seconds
%time result = applyParallel(datesFrame.groupby(datesFrame.index), get_nearest_dateParallel)

but I prefer @NinjaPuppy's approach because it does not require O(n * number_of_holidays)

like image 163
Georg Heiler Avatar answered Sep 21 '22 13:09

Georg Heiler