Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to apply Pandas Groupby with multiple conditions for split and apply multiple calculations?

I have a dataframe that I need to split based on 2 conditions and this conditions applied to 3 different columns, then using those new splits perform some calculations on other columns in a pairwise manner.

So I have a df A

A = pd.DataFrame([[1, 5, 2, 3, 4, 1], [2, 4, 4, 5, 5, 1], [3, 3, 1, 2, 5, 4], [4, 2, 2, 3, 5, 5], [5, 1, 4, 2, 3, 1]],
                 columns=['A', 'B', 'C', 'D', 'E', 'F'], index=[1, 2, 3, 4, 5])

I need to split the dataframe based on 2 conditions and applied to the first three columns, like this"

condition1_colA = A[A['A'] > 1]
condition2_colA = A[A['A'] < 1]
condition1_colB = A[A['B'] > 1]
condition2_colB = A[A['B'] < 1]
condition1_colC = A[A['C'] > 1]
condition2_colC = A[A['C'] < 1]

Now, I need to do 3 calculations on each of 6 new sliced dataframes, however this calculations are paired up in the following way:

  • condition1_colA and condition2_colA with column['D']
  • condition1_colB and condition2_colB with column['E']
  • condition1_colC and condition2_colC with column['F']

Here would be the example for the first pair:

x = condition1_colA['D'].mean()
y = condition1_colA['D'].std()
z = condition1_colA['D'][condition1_colA['D'] > 0].count()
x1 = condition2_colA['D'].mean()
y1 = condition2_colA['D'].std()
z1 = condition2_colA['D'][condition1_colA['D'] < 0].count()

Then we would do the same for the other 2 pairs. And organize the 18 calculations on a datframe. I know how to do this manually, but I'm hoping for some help doing it with groupby or perhaps pivot table.

Thanks

like image 566
hernanavella Avatar asked Sep 29 '22 21:09

hernanavella


1 Answers

import pandas as pd

A = pd.DataFrame(
    [[1, 5, 2, 3, 4, 1], [2, 4, 4, 5, 5, 1], [3, 3, 1, 2, 5, 4],
     [4, 2, 2, 3, 5, 5], [5, 1, 4, 2, 3, 1]],
    columns=['A', 'B', 'C', 'D', 'E', 'F'], index=[1, 2, 3, 4, 5])

result = []
pairs = []
for col1, col2 in zip(list('ABC'), list('DEF')):
    pairs.append(col1+col2)
    mask = A[col1] > 1
    mask2 = mask & (A[col2] > 0)
    ser = A.loc[mask, col2]
    result.append((ser.mean(), ser.std(), mask2.sum()))

    mask = A[col1] < 1
    mask2 = mask & (A[col2] < 0)
    ser = A.loc[mask, col2]
    result.append((ser.mean(), ser.std(), mask2.sum()))

index = pd.MultiIndex.from_product([pairs, [1, 2]], names=['pair', 'cond'])
df = pd.DataFrame(result, index=index, columns=['mean', 'sum', 'count'])
print(df)

yields

           mean       sum  count
pair cond                       
AD   1     3.00  1.414214      4
     2      NaN       NaN      0
BE   1     4.75  0.500000      4
     2      NaN       NaN      0
CF   1     2.00  2.000000      4
     2      NaN       NaN      0
like image 98
unutbu Avatar answered Oct 20 '22 07:10

unutbu