New to pandas, I already want to parallelize a row-wise apply operation. So far I found Parallelize apply after pandas groupby However, that only seems to work for grouped data frames. My use case is different: I have a list of holidays and for my current row/date want to find the no-of-days before and after this day to the next holiday. This is the function I call via apply: <pre class="prettyprint"><code>def get_nearest_holiday(x, pivot): nearestHoliday = min(x, key=lambda x: abs(x- pivot)) difference = abs(nearesHoliday - pivot) return difference / np.timedelta64(1, 'D') </code></pre> How can I speed it up? <h3>edit</h3> I experimented a bit with pythons pools - but it was neither nice code, nor did I get my computed results.

For the parallel approach this is the answer based on Parallelize apply after pandas groupby: <pre class="prettyprint"><code>from joblib import Parallel, delayed import multiprocessing def get_nearest_dateParallel(df): df['daysBeforeHoliday'] = df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day < x], x)) df['daysAfterHoliday'] = df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day > x], x)) return df def applyParallel(dfGrouped, func): retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped) return pd.concat(retLst) print ('parallel version: ') # 4 min 30 seconds %time result = applyParallel(datesFrame.groupby(datesFrame.index), get_nearest_dateParallel) </code></pre> but I prefer @NinjaPuppy's approach because it does not require O(n * number_of_holidays)

Parallelize pandas apply

Tags:

python

pandas

parallel-processing

apply

embarrassingly-parallel

New to pandas, I already want to parallelize a row-wise apply operation. So far I found Parallelize apply after pandas groupby However, that only seems to work for grouped data frames.

My use case is different: I have a list of holidays and for my current row/date want to find the no-of-days before and after this day to the next holiday.

This is the function I call via apply:

Click to copy

def get_nearest_holiday(x, pivot):
    nearestHoliday = min(x, key=lambda x: abs(x- pivot))
    difference = abs(nearesHoliday - pivot)
    return difference / np.timedelta64(1, 'D')

How can I speed it up?

edit

I experimented a bit with pythons pools - but it was neither nice code, nor did I get my computed results.

997

asked Sep 02 '16 05:09

Georg Heiler

1 Answers

For the parallel approach this is the answer based on Parallelize apply after pandas groupby:

Click to copy

from joblib import Parallel, delayed
import multiprocessing

def get_nearest_dateParallel(df):
    df['daysBeforeHoliday'] = df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day < x], x))
    df['daysAfterHoliday']  =  df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day > x], x))
    return df

def applyParallel(dfGrouped, func):
    retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)
    return pd.concat(retLst)

print ('parallel version: ')
# 4 min 30 seconds
%time result = applyParallel(datesFrame.groupby(datesFrame.index), get_nearest_dateParallel)

but I prefer @NinjaPuppy's approach because it does not require O(n * number_of_holidays)

163

answered Sep 21 '22 13:09

Georg Heiler

Related questions
                            
                                Parse BeautifulSoup element into Selenium
                            
                                Reading large file in Spark issue - python
                            
                                catch exception and return empty dataframe
                            
                                Dividing Pandas Dataframe by Week
                            
                                How to drop rows in an H2OFrame?
                            
                                Handle invalid arguments with argparse in Python
                            
                                multiprocessing module and distinct psycopg2 connections
                            
                                Angular-cli with any other server
                            
                                Tensorflow: why is zip() function used in the steps involving applying the gradients?
                            
                                Finding new position (x,y) after resizing image
                            
                                Customize Keras' loss function in a way that the y_true will depend on y_pred
                            
                                Howto copy a dask dataframe?
                            
                                What exactly happens on the computer when multiple requests came to the webserver serving django or pyramid application?
                            
                                What specifically should the domain be for NTLM authentication when using python-requests library?
                            
                                How to create image from numpy float32 array?
                            
                                How to do a "tree walk" recursively on an Abstract Syntax Tree?
                            
                                Absolute Import Not Working, But Relative Import Does
                            
                                Call a C++ function from Python and convert a OpenCV Mat to a Numpy array
                            
                                Issues with Python pandas: read_html and python3-lxml installation
                            
                                Pandas plot hist sharex=False does not behave as expected

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parallelize pandas apply

Tags:

python

pandas

parallel-processing

apply

embarrassingly-parallel

edit

Georg Heiler

People also ask

1 Answers

Georg Heiler

Recent Activity

Donate For Us