Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different groupers for each column with pandas GroupBy

How could I use a multidimensional Grouper, in this case another dataframe, as a Grouper for another dataframe? Can it be done in one step?

My question is essentially regarding how to perform an actual grouping under these circumstances, but to make it more specific, say I want to then transform and take the sum.

Consider for example:

df1 = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]})

print(df1)
   a  b
0  1  5
1  2  6
2  3  7
3  4  8

df2  = pd.DataFrame({'a':['A','B','A','B'], 'b':['A','A','B','B']})

print(df2)
   a  b
0  A  A
1  B  A
2  A  B
3  B  B

Then, the expected output would be:

   a  b
0  4  11
1  6  11
2  4  15
3  6  15

Where columns a and b in df1 have been grouped by columns a and b from df2 respectively.

like image 982
yatu Avatar asked Jan 15 '19 16:01

yatu


3 Answers

You will have to group each column individually since each column uses a different grouping scheme.

If you want a cleaner version, I would recommend a list comprehension over the column names, and call pd.concat on the resultant series:

pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)

   a   b
0  4  11
1  6  11
2  4  15
3  6  15

Not to say there's anything wrong with using apply as in the other answer, just that I don't like apply, so this is my suggestion :-)


Here are some timeits for your perusal. Just for your sample data, you will notice the difference in timings is obvious.

%%timeit 
(df1.stack()
    .groupby([df2.stack().index.get_level_values(level=1), df2.stack()])
    .transform('sum').unstack())
%%timeit 
df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))
%%timeit 
pd.concat([df1[c].groupby(df2[c]).transform('sum') for c in df1.columns], axis=1)

8.99 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.35 ms ± 859 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.13 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Not to say apply is slow, but explicit iteration in this case is faster. Additionally, you will notice the second and third timed solution will scale better with larger length v/s breadth since the number of iterations depends on the number of columns.

like image 99
cs95 Avatar answered Oct 17 '22 16:10

cs95


Try using apply to apply a lambda function to each column of your dataframe, then use the name of that pd.Series to group by the second dataframe:

df1.apply(lambda x: x.groupby(df2[x.name]).transform('sum'))

Output:

   a   b
0  4  11
1  6  11
2  4  15
3  6  15
like image 45
Scott Boston Avatar answered Oct 17 '22 14:10

Scott Boston


Using stack and unstack

df1.stack().groupby([df2.stack().index.get_level_values(level=1),df2.stack()]).transform('sum').unstack()
Out[291]: 
   a   b
0  4  11
1  6  11
2  4  15
3  6  15
like image 4
BENY Avatar answered Oct 17 '22 15:10

BENY