I have 2 dataframes like this... <pre class="prettyprint"><code>np.random.seed(0) a = pd.DataFrame(np.random.randn(20,3)) b = pd.DataFrame(np.random.randint(1,5,size=(20,3))) </code></pre> I'd like to find the average of values in <code>a</code> for the 4 groups in <code>b</code>. This... <pre class="prettyprint"><code>a[b==1].sum().sum() / a[b==1].count().sum() </code></pre> ...works for doing one group at a time, but I was wondering if anyone could think of a cleaner method. My expected result is <pre class="prettyprint"><code>1 -0.088715 2 -0.340043 3 -0.045596 4 0.582136 dtype: float64 </code></pre> Thanks.

You can <code>stack</code> then <code>groupby</code> two <code>Series</code> <pre class="prettyprint"><code>a.stack().groupby(b.stack()).mean() </code></pre>

If you want a fast numpy solution, use <code>np.unique</code> and <code>np.bincount</code>: <pre class="prettyprint"><code>c, d = (a_.to_numpy().ravel() for a_ in [a, b]) u, i, cnt = np.unique(d, return_inverse=True, return_counts=True) np.bincount(i, c) / cnt # array([-0.0887145 , -0.34004319, -0.04559595, 0.58213553]) </code></pre> To construct a Series, use <pre class="prettyprint"><code>pd.Series(np.bincount(i, c) / cnt, index=u) 1 -0.088715 2 -0.340043 3 -0.045596 4 0.582136 dtype: float64 </code></pre> For comparison, <code>stack</code> returns, <pre class="prettyprint"><code>a.stack().groupby(b.stack()).mean() 1 -0.088715 2 -0.340043 3 -0.045596 4 0.582136 dtype: float64 </code></pre> <hr> <pre class="prettyprint"><code>%timeit a.stack().groupby(b.stack()).mean() %%timeit c, d = (a_.to_numpy().ravel() for a_ in [a, b]) u, i, cnt = np.unique(d, return_inverse=True, return_counts=True) np.bincount(i, c) / cnt 5.16 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 113 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) </code></pre>

GroupBy operation using an entire dataframe to group values

Tags:

python

pandas

group-by

pandas-groupby

I have 2 dataframes like this...

np.random.seed(0)
a = pd.DataFrame(np.random.randn(20,3))
b = pd.DataFrame(np.random.randint(1,5,size=(20,3)))

I'd like to find the average of values in a for the 4 groups in b.

This...

a[b==1].sum().sum() / a[b==1].count().sum()

...works for doing one group at a time, but I was wondering if anyone could think of a cleaner method.

My expected result is

1   -0.088715
2   -0.340043
3   -0.045596
4    0.582136
dtype: float64

Thanks.

358

asked Jun 06 '19 15:06

MJS

2 Answers

You can stack then groupby two Series

a.stack().groupby(b.stack()).mean()

123

answered Sep 19 '22 22:09

BENY

If you want a fast numpy solution, use np.unique and np.bincount:

c, d = (a_.to_numpy().ravel() for a_ in [a, b]) 
u, i, cnt = np.unique(d, return_inverse=True, return_counts=True)

np.bincount(i, c) / cnt
# array([-0.0887145 , -0.34004319, -0.04559595,  0.58213553])

To construct a Series, use

pd.Series(np.bincount(i, c) / cnt, index=u)

1   -0.088715
2   -0.340043
3   -0.045596
4    0.582136
dtype: float64

For comparison, stack returns,

a.stack().groupby(b.stack()).mean()

1   -0.088715
2   -0.340043
3   -0.045596
4    0.582136
dtype: float64

%timeit a.stack().groupby(b.stack()).mean()
%%timeit
c, d = (a_.to_numpy().ravel() for a_ in [a, b]) 
u, i, cnt = np.unique(d, return_inverse=True, return_counts=True)
np.bincount(i, c) / cnt

5.16 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
113 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

answered Sep 18 '22 22:09

cs95

Related questions
                            
                                How to take a pathname string with wildcards and resolve the glob with pathlib?
                            
                                Python: When should we name the parameters we're passing?
                            
                                pycharm doesn't see python3.7 interpreter
                            
                                Sklearn fit vs predict, order of columns matters?
                            
                                What does p stand for in "fp" of with open(filename, "w") as fp:
                            
                                Installing Apache-Airflow in Conda Environment
                            
                                Loop break breaking tqdm
                            
                                Numpy "Where" function can not avoid evaluate Sqrt(negative)
                            
                                Keras breaks Anaconda Prompt
                            
                                How to do forward filling for each group in pandas
                            
                                Output multiple losses added by add_loss in Keras
                            
                                How to check and get Alexa slot value with Python ask sdk
                            
                                Open a Word Document Using Python [duplicate]
                            
                                Package missing in Alpine Linux even though it's listed on package repo website [closed]
                            
                                building wheel for dlib (setup.py) loop
                            
                                No module named PyQt5.sip
                            
                                how to save a pandas DataFrame to an excel file?
                            
                                How to get authenticated identity response from AWS Cognito using boto3
                            
                                "not all arguments converted during string formatting" when to_sql
                            
                                Case-sensitive entity recognition

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With