Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compute on pandas dataframe concurrently

Is it feasible to do multiple group-wise calculation in dataframe in pandas concurrently and get those results back? So, I'd like to compute the following sets of dataframe and get those results one-by-one, and finally merge them into one dataframe.

df_a = df.groupby(["state", "person"]).apply(lambda x: np.mean(x["height"]))
df_b = df.groupby(["state", "person"]).apply(lambda x: np.mean(x["weight"]))
df_c = df.groupby(["state", "person"]).apply(lambda x: xp["number"].sum())

And then,

df_final = merge(df_a, df_b) # omitting the irrelevant part

However, as far as I know, those functionalities at multiprocessing don't fit my needs here, since it looks more like concurrently run multiple functions that don't return the internally-created, local variables, and instead just print some output within the function (e.g. oft-used is_prime function), or concurrently run a single function with different sets of arguments (e.g. map function in multiprocessing), if I understand it correctly (I'm not sure I understand it correctly, so correct me if I'm wrong!).

However, what I'd like to implement is just run those three (and actually, more) simultaneously and finally merge them together, once all of those computation on dataframe are completed successfully. I assume the kind of functionalities implemented in Go (goroutines and channels), by perhaps creating each function respectively, running them one-by-one, concurrently, waiting for all of them completed, and finally merging them together.

So how can it be written in Python? I read the documentation of multiprocessing, threading, and concurrent.futures, but all of them are too elusive for me, that I don't even understand whether I can utilize those libraries to begin with...

(I made the code precise for the purpose of brevity and the actual code is more complicated, so please don't answer "Yeah, you can write it in one line and in non-concurrent way" or something like that.)

Thanks.

like image 709
Blaszard Avatar asked Nov 08 '13 00:11

Blaszard


1 Answers

9 Months later and this is still one of the top results for working with multiprocessing and pandas. I hope you've found some type of answer at this point, but if not I've got one that seems to work and hopefully it will help others who view this question.

import pandas as pd
import numpy as np
#sample data
df = pd.DataFrame([[1,2,3,1,2,3,1,2,3,1],[2,2,2,2,2,2,2,2,2,2],[1,3,5,7,9,2,4,6,8,0],[2,4,6,8,0,1,3,5,7,9]]).transpose()
df.columns=['a','b','c','d']
df

   a  b  c  d
0  1  2  1  2
1  2  2  3  4
2  3  2  5  6
3  1  2  7  8
4  2  2  9  0
5  3  2  2  1
6  1  2  4  3
7  2  2  6  5
8  3  2  8  7
9  1  2  0  9


#this one function does the three functions you had used in your question, obviously you could add more functions or different ones for different groupby things
def f(x):
    return [np.mean(x[1]['c']),np.mean(x[1]['d']),x[1]['d'].sum()]

#sets up a pool with 4 cpus
from multiprocessing import Pool
pool = Pool(4)

#runs the statistics you wanted on each group
group_df = pd.DataFrame(pool.map(f,df.groupby(['a','b'])))
group_df
   0         1   2
0  3  5.500000  22
1  6  3.000000   9
2  5  4.666667  14

group_df['keys']=df.groupby(['a','b']).groups.keys()

group_df
   0         1   2    keys
0  3  5.500000  22  (1, 2)
1  6  3.000000   9  (3, 2)
2  5  4.666667  14  (2, 2)

At the least I hope this helps someone who's looking at this stuff in the future

like image 126
James Tobin Avatar answered Sep 21 '22 05:09

James Tobin