Is it feasible to do multiple group-wise calculation in dataframe in pandas concurrently and get those results back? So, I'd like to compute the following sets of dataframe and get those results one-by-one, and finally merge them into one dataframe.
df_a = df.groupby(["state", "person"]).apply(lambda x: np.mean(x["height"]))
df_b = df.groupby(["state", "person"]).apply(lambda x: np.mean(x["weight"]))
df_c = df.groupby(["state", "person"]).apply(lambda x: xp["number"].sum())
And then,
df_final = merge(df_a, df_b) # omitting the irrelevant part
However, as far as I know, those functionalities at multiprocessing
don't fit my needs here, since it looks more like concurrently run multiple functions that don't return the internally-created, local variables, and instead just print some output within the function (e.g. oft-used is_prime
function), or concurrently run a single function with different sets of arguments (e.g. map
function in multiprocessing
), if I understand it correctly (I'm not sure I understand it correctly, so correct me if I'm wrong!).
However, what I'd like to implement is just run those three (and actually, more) simultaneously and finally merge them together, once all of those computation on dataframe are completed successfully. I assume the kind of functionalities implemented in Go
(goroutines
and channels
), by perhaps creating each function respectively, running them one-by-one, concurrently, waiting for all of them completed, and finally merging them together.
So how can it be written in Python? I read the documentation of multiprocessing
, threading
, and concurrent.futures
, but all of them are too elusive for me, that I don't even understand whether I can utilize those libraries to begin with...
(I made the code precise for the purpose of brevity and the actual code is more complicated, so please don't answer "Yeah, you can write it in one line and in non-concurrent way" or something like that.)
Thanks.
9 Months later and this is still one of the top results for working with multiprocessing and pandas. I hope you've found some type of answer at this point, but if not I've got one that seems to work and hopefully it will help others who view this question.
import pandas as pd
import numpy as np
#sample data
df = pd.DataFrame([[1,2,3,1,2,3,1,2,3,1],[2,2,2,2,2,2,2,2,2,2],[1,3,5,7,9,2,4,6,8,0],[2,4,6,8,0,1,3,5,7,9]]).transpose()
df.columns=['a','b','c','d']
df
a b c d
0 1 2 1 2
1 2 2 3 4
2 3 2 5 6
3 1 2 7 8
4 2 2 9 0
5 3 2 2 1
6 1 2 4 3
7 2 2 6 5
8 3 2 8 7
9 1 2 0 9
#this one function does the three functions you had used in your question, obviously you could add more functions or different ones for different groupby things
def f(x):
return [np.mean(x[1]['c']),np.mean(x[1]['d']),x[1]['d'].sum()]
#sets up a pool with 4 cpus
from multiprocessing import Pool
pool = Pool(4)
#runs the statistics you wanted on each group
group_df = pd.DataFrame(pool.map(f,df.groupby(['a','b'])))
group_df
0 1 2
0 3 5.500000 22
1 6 3.000000 9
2 5 4.666667 14
group_df['keys']=df.groupby(['a','b']).groups.keys()
group_df
0 1 2 keys
0 3 5.500000 22 (1, 2)
1 6 3.000000 9 (3, 2)
2 5 4.666667 14 (2, 2)
At the least I hope this helps someone who's looking at this stuff in the future
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With