I have used <code>rosetta.parallel.pandas_easy</code> to parallelize <code>apply</code> after <code>groupby</code>, for example: <pre class="prettyprint"><code>from rosetta.parallel.pandas_easy import groupby_to_series_to_frame df = pd.DataFrame({'a': [6, 2, 2], 'b': [4, 5, 6]},index= ['g1', 'g1', 'g2']) groupby_to_series_to_frame(df, np.mean, n_jobs=8, use_apply=True, by=df.index) </code></pre> However, has anyone figured out how to parallelize a function that returns a DataFrame? This code fails for <code>rosetta</code>, as expected. <pre class="prettyprint"><code>def tmpFunc(df): df['c'] = df.a + df.b return df df.groupby(df.index).apply(tmpFunc) groupby_to_series_to_frame(df, tmpFunc, n_jobs=1, use_apply=True, by=df.index) </code></pre>

Ivan's answer is great, but it looks like it can be slightly simplified, also removing the need to depend on joblib: <pre class="prettyprint"><code>from multiprocessing import Pool, cpu_count def applyParallel(dfGrouped, func): with Pool(cpu_count()) as p: ret_list = p.map(func, [group for name, group in dfGrouped]) return pandas.concat(ret_list) </code></pre> By the way: this can not replace any groupby.apply(), but it will cover the typical cases: e.g. it should cover cases 2 and 3 in the documentation, while you should obtain the behaviour of case 1 by giving the argument <code>axis=1</code> to the final <code>pandas.concat()</code> call. EDIT: the docs changed; the old version can be found here, in any case I'm copypasting the three examples below. <pre class="prettyprint"><code>case 1: group DataFrame apply aggregation function (f(chunk) -> Series) yield DataFrame, with group axis having group labels case 2: group DataFrame apply transform function ((f(chunk) -> DataFrame with same indexes) yield DataFrame with resulting chunks glued together case 3: group Series apply function with f(chunk) -> DataFrame yield DataFrame with result of chunks glued together </code></pre>

Parallelize apply after pandas groupby

Tags:

python

pandas

parallel-processing

I have used rosetta.parallel.pandas_easy to parallelize apply after groupby, for example:

from rosetta.parallel.pandas_easy import groupby_to_series_to_frame df = pd.DataFrame({'a': [6, 2, 2], 'b': [4, 5, 6]},index= ['g1', 'g1', 'g2']) groupby_to_series_to_frame(df, np.mean, n_jobs=8, use_apply=True, by=df.index)

However, has anyone figured out how to parallelize a function that returns a DataFrame? This code fails for rosetta, as expected.

def tmpFunc(df):     df['c'] = df.a + df.b     return df  df.groupby(df.index).apply(tmpFunc) groupby_to_series_to_frame(df, tmpFunc, n_jobs=1, use_apply=True, by=df.index)

391

asked Oct 03 '14 22:10

Ivan

2 Answers

This seems to work, although it really should be built in to pandas

import pandas as pd from joblib import Parallel, delayed import multiprocessing  def tmpFunc(df):     df['c'] = df.a + df.b     return df  def applyParallel(dfGrouped, func):     retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)     return pd.concat(retLst)  if __name__ == '__main__':     df = pd.DataFrame({'a': [6, 2, 2], 'b': [4, 5, 6]},index= ['g1', 'g1', 'g2'])     print 'parallel version: '     print applyParallel(df.groupby(df.index), tmpFunc)      print 'regular version: '     print df.groupby(df.index).apply(tmpFunc)      print 'ideal version (does not work): '     print df.groupby(df.index).applyParallel(tmpFunc)

166

answered Sep 17 '22 23:09

Ivan

Ivan's answer is great, but it looks like it can be slightly simplified, also removing the need to depend on joblib:

from multiprocessing import Pool, cpu_count  def applyParallel(dfGrouped, func):     with Pool(cpu_count()) as p:         ret_list = p.map(func, [group for name, group in dfGrouped])     return pandas.concat(ret_list)

By the way: this can not replace any groupby.apply(), but it will cover the typical cases: e.g. it should cover cases 2 and 3 in the documentation, while you should obtain the behaviour of case 1 by giving the argument axis=1 to the final pandas.concat() call.

EDIT: the docs changed; the old version can be found here, in any case I'm copypasting the three examples below.

case 1: group DataFrame apply aggregation function (f(chunk) -> Series) yield DataFrame, with group axis having group labels  case 2: group DataFrame apply transform function ((f(chunk) -> DataFrame with same indexes) yield DataFrame with resulting chunks glued together  case 3: group Series apply function with f(chunk) -> DataFrame yield DataFrame with result of chunks glued together

answered Sep 18 '22 23:09

Pietro Battiston

Related questions
                            
                                Flipping the boolean values in a list Python
                            
                                "Cannot open include file: 'config-win.h': No such file or directory" while installing mysql-python
                            
                                Numpy: Creating a complex array from 2 real ones?
                            
                                Python: download files from google drive using url
                            
                                How to hide *pyc files in atom editor
                            
                                overlay a smaller image on a larger image python OpenCv
                            
                                How to make a Tkinter window jump to the front?
                            
                                In what order does os.walk iterates iterate? [duplicate]
                            
                                missing python bz2 module
                            
                                How to create user from django shell
                            
                                How to split strings into text and number?
                            
                                Python: download a file from an FTP server
                            
                                How to Fix Permissions on Home-brew on MacOS High Sierra
                            
                                Convert alphabet letters to number in Python
                            
                                Complete scan of dynamoDb with boto3
                            
                                Launch a shell command with in a python script, wait for the termination and return to the script
                            
                                Locale date formatting in Python
                            
                                Passing values in Python [duplicate]
                            
                                How to obtain image size using standard Python class (without using external library)?
                            
                                Can a lambda function call itself recursively in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parallelize apply after pandas groupby

Tags:

python

pandas

parallel-processing

Ivan

People also ask

2 Answers

Ivan

Pietro Battiston

Recent Activity

Donate For Us