Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GroupBy operation using an entire dataframe to group values

I have 2 dataframes like this...

np.random.seed(0)
a = pd.DataFrame(np.random.randn(20,3))
b = pd.DataFrame(np.random.randint(1,5,size=(20,3)))

I'd like to find the average of values in a for the 4 groups in b.

This...

a[b==1].sum().sum() / a[b==1].count().sum()

...works for doing one group at a time, but I was wondering if anyone could think of a cleaner method.

My expected result is

1   -0.088715
2   -0.340043
3   -0.045596
4    0.582136
dtype: float64

Thanks.

like image 358
MJS Avatar asked Jun 06 '19 15:06

MJS


People also ask

What is possible using Groupby () method of pandas?

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. sort : Sort group keys.

Can you use Groupby with multiple columns in pandas?

Grouping by Multiple ColumnsYou can do this by passing a list of column names to groupby instead of a single string value.

How do I group values in a column in pandas?

Groupby is a very powerful pandas method. You can group by one column and count the values of another column per this column value using value_counts. Using groupby and value_counts we can count the number of activities each person did.


2 Answers

You can stack then groupby two Series

a.stack().groupby(b.stack()).mean()
like image 123
BENY Avatar answered Sep 19 '22 22:09

BENY


If you want a fast numpy solution, use np.unique and np.bincount:

c, d = (a_.to_numpy().ravel() for a_ in [a, b]) 
u, i, cnt = np.unique(d, return_inverse=True, return_counts=True)

np.bincount(i, c) / cnt
# array([-0.0887145 , -0.34004319, -0.04559595,  0.58213553])

To construct a Series, use

pd.Series(np.bincount(i, c) / cnt, index=u)

1   -0.088715
2   -0.340043
3   -0.045596
4    0.582136
dtype: float64

For comparison, stack returns,

a.stack().groupby(b.stack()).mean()

1   -0.088715
2   -0.340043
3   -0.045596
4    0.582136
dtype: float64

%timeit a.stack().groupby(b.stack()).mean()
%%timeit
c, d = (a_.to_numpy().ravel() for a_ in [a, b]) 
u, i, cnt = np.unique(d, return_inverse=True, return_counts=True)
np.bincount(i, c) / cnt

5.16 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
113 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
like image 45
cs95 Avatar answered Sep 18 '22 22:09

cs95