Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas multiprocessing apply

I'm trying to use multiprocessing with pandas dataframe, that is split the dataframe to 8 parts. apply some function to each part using apply (with each part processed in different process).

EDIT: Here's the solution I finally found:

import multiprocessing as mp import pandas.util.testing as pdt  def process_apply(x):     # do some stuff to data here  def process(df):     res = df.apply(process_apply, axis=1)     return res  if __name__ == '__main__':     p = mp.Pool(processes=8)     split_dfs = np.array_split(big_df,8)     pool_results = p.map(aoi_proc, split_dfs)     p.close()     p.join()      # merging parts processed by different processes     parts = pd.concat(pool_results, axis=0)      # merging newly calculated parts to big_df     big_df = pd.concat([big_df, parts], axis=1)      # checking if the dfs were merged correctly     pdt.assert_series_equal(parts['id'], big_df['id']) 
like image 816
yemu Avatar asked Nov 06 '14 16:11

yemu


1 Answers

You can use https://github.com/nalepae/pandarallel, as in the following example:

from pandarallel import pandarallel from math import sin  pandarallel.initialize()  def func(x):     return sin(x**2)  df.parallel_apply(func, axis=1)  
like image 158
Sébastien Vincent Avatar answered Oct 03 '22 07:10

Sébastien Vincent