Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use groupby to apply multiple functions to multiple columns in Pandas?

I have a normal df

A = pd.DataFrame([[1, 5, 2], [2, 4, 4], [3, 3, 1], [4, 2, 2], [5, 1, 4]],
                 columns=['A', 'B', 'C'], index=[1, 2, 3, 4, 5])

Following this recipe, I got the the results I wanted.

In [62]: A.groupby((A['A'] > 2)).apply(lambda x: pd.Series(dict(
                   up_B=(x.B >= 0).sum(), down_B=(x.B < 0).sum(), mean_B=(x.B).mean(), std_B=(x.B).std(),
                   up_C=(x.C >= 0).sum(), down_C=(x.C < 0).sum(), mean_C=(x.C).mean(), std_C=(x.C).std())))

Out[62]:
       down_B  down_C  mean_B    mean_C     std_B     std_C  up_B  up_C
A                                                                      
False       0       0     4.5  3.000000  0.707107  1.414214     2     2
True        0       0     2.0  2.333333  1.000000  1.527525     3     3

This approach is fine, but imagine you had to do this for a large number of columns (15-100), then you have to type all that stuff in the formula, which can be cumbersome.

Given that the same formulas are applied to ALL columns. Is there an efficient way to do this for a large number of columns?.

Thanks

like image 906
hernanavella Avatar asked Oct 05 '14 19:10

hernanavella


People also ask

Can you use groupby with multiple columns in pandas?

Grouping by Multiple ColumnsYou can do this by passing a list of column names to groupby instead of a single string value.

How do you group by and sum multiple columns in pandas?

Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.

What are the three phases of the pandas groupby () function?

The “group by” process: split-apply-combine (1) Splitting the data into groups. (2). Applying a function to each group independently, (3) Combining the results into a data structure.


1 Answers

Since you are aggregating each grouped column into one value, you can use agg instead of apply. The agg method can take a list of functions as input. The functions will be applied to each column:

def up(x):
    return (x >= 0).sum()
def down(x):
    return (x < 0).sum()

result = A.loc[:, 'B':'C'].groupby((A['A'] > 2)).agg(
             [up, down, 'mean', 'std'])
print(result)

yields

       B                      C                         
      up down mean       std up down      mean       std
A                                                       
False  2    0  4.5  0.707107  2    0  3.000000  1.414214
True   3    0  2.0  1.000000  3    0  2.333333  1.527525

result has hierarchical ("MultiIndexed") columns. To select a certain column (or columns), you could use:

In [39]: result['B','mean']
Out[39]: 
A
False    4.5
True     2.0
Name: (B, mean), dtype: float64

In [46]: result[[('B', 'mean'), ('C', 'mean')]]
Out[46]: 
         B         C
      mean      mean
A                   
False  4.5  3.000000
True   2.0  2.333333

or you could move one level of the MultiIndex to the index:

In [40]: result.stack()
Out[40]: 
                   B         C
A                             
False up    2.000000  2.000000
      down  0.000000  0.000000
      mean  4.500000  3.000000
      std   0.707107  1.414214
True  up    3.000000  3.000000
      down  0.000000  0.000000
      mean  2.000000  2.333333
      std   1.000000  1.527525
like image 106
unutbu Avatar answered Nov 02 '22 04:11

unutbu