I have a normal df
A = pd.DataFrame([[1, 5, 2], [2, 4, 4], [3, 3, 1], [4, 2, 2], [5, 1, 4]],
columns=['A', 'B', 'C'], index=[1, 2, 3, 4, 5])
Following this recipe, I got the the results I wanted.
In [62]: A.groupby((A['A'] > 2)).apply(lambda x: pd.Series(dict(
up_B=(x.B >= 0).sum(), down_B=(x.B < 0).sum(), mean_B=(x.B).mean(), std_B=(x.B).std(),
up_C=(x.C >= 0).sum(), down_C=(x.C < 0).sum(), mean_C=(x.C).mean(), std_C=(x.C).std())))
Out[62]:
down_B down_C mean_B mean_C std_B std_C up_B up_C
A
False 0 0 4.5 3.000000 0.707107 1.414214 2 2
True 0 0 2.0 2.333333 1.000000 1.527525 3 3
This approach is fine, but imagine you had to do this for a large number of columns (15-100), then you have to type all that stuff in the formula, which can be cumbersome.
Given that the same formulas are applied to ALL columns. Is there an efficient way to do this for a large number of columns?.
Thanks
Grouping by Multiple ColumnsYou can do this by passing a list of column names to groupby instead of a single string value.
Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.
The “group by” process: split-apply-combine (1) Splitting the data into groups. (2). Applying a function to each group independently, (3) Combining the results into a data structure.
Since you are aggregating each grouped column into one value, you can use agg
instead of apply
. The agg
method can take a list of functions as input. The functions will be applied to each column:
def up(x):
return (x >= 0).sum()
def down(x):
return (x < 0).sum()
result = A.loc[:, 'B':'C'].groupby((A['A'] > 2)).agg(
[up, down, 'mean', 'std'])
print(result)
yields
B C
up down mean std up down mean std
A
False 2 0 4.5 0.707107 2 0 3.000000 1.414214
True 3 0 2.0 1.000000 3 0 2.333333 1.527525
result
has hierarchical ("MultiIndexed") columns. To select a certain column (or columns), you could use:
In [39]: result['B','mean']
Out[39]:
A
False 4.5
True 2.0
Name: (B, mean), dtype: float64
In [46]: result[[('B', 'mean'), ('C', 'mean')]]
Out[46]:
B C
mean mean
A
False 4.5 3.000000
True 2.0 2.333333
or you could move one level of the MultiIndex to the index:
In [40]: result.stack()
Out[40]:
B C
A
False up 2.000000 2.000000
down 0.000000 0.000000
mean 4.500000 3.000000
std 0.707107 1.414214
True up 3.000000 3.000000
down 0.000000 0.000000
mean 2.000000 2.333333
std 1.000000 1.527525
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With