Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Groupby two columns and find 25th, median, 75th percentile AND mean of 3 columns in LONG format [duplicate]

Here is an example DataFrame:

df = pd.DataFrame([[1, 1, 10, 11, 12],
                    [1, 1, 13, 14, 15], 
                    [1, 2, 16, 17, 18], 
                    [1, 2, 19, 20, 21],
                    [1, 3, 22, 23, 24], 
                    [1, 3, 25, 26, 27],
                    [1, 4, 28, 29, 30], 
                    [1, 4, 31, 32, 33], 
                    [1, 4, 34, 35, 36],
                    [1, 4, 37, 38, 39],
                    [1, 4, 40, 41, 42]])

df.columns = ['c1', 'c2', 'p1', 'p2', 'p3']
print(df)

Gives:

    c1  c2  p1  p2  p3
0    1   1  10  11  12
1    1   1  13  14  15
2    1   2  16  17  18
3    1   2  19  20  21
4    1   3  22  23  24
5    1   3  25  26  27
6    1   4  28  29  30
7    1   4  31  32  33
8    1   4  34  35  36
9    1   4  37  38  39
10   1   4  40  41  42

What I have done so far:

example = df.groupby(['c1', 'c2'])['p1', 'p2', 'p3'].quantile([0.25, 0.50, 0.75]).unstack().reset_index()

print(example)

Gives:

  c1 c2     p1                  p2                  p3             
          0.25   0.5   0.75   0.25   0.5   0.75   0.25   0.5   0.75
0  1  1  10.75  11.5  12.25  11.75  12.5  13.25  12.75  13.5  14.25
1  1  2  16.75  17.5  18.25  17.75  18.5  19.25  18.75  19.5  20.25
2  1  3  22.75  23.5  24.25  23.75  24.5  25.25  24.75  25.5  26.25
3  1  4  31.00  34.0  37.00  32.00  35.0  38.00  33.00  36.0  39.00

The output I have above is CORRECT to find the percentiles, but I also want the Average/Mean + The above format is in wide format, I would like it to be in long format.

So,

In the wide format, I would want another column called average

  c1 c2     p1                          p2                              p3             
          0.25   0.5   0.75  average    0.25   0.5   0.75   average     0.25   0.5   0.75   average
0  1  1  10.75  11.5  12.25     X       11.75  12.5  13.25     X        12.75  13.5  14.25    X
1  1  2  16.75  17.5  18.25     X       17.75  18.5  19.25     X        18.75  19.5  20.25    X
2  1  3  22.75  23.5  24.25     X       23.75  24.5  25.25     X        24.75  25.5  26.25    X
3  1  4  31.00  34.0  37.00     X       32.00  35.0  38.00     X        33.00  36.0  39.00    X 

The final OUTPUT that I'm looking for is the above table in long format like so:

    c1      c2      0.25    0.50    0.75    average      p
    1       1       10.75   11.5    12.25      X         1
    1       1       11.75   12.5    13.25      X         2
    1       1       2.75    13.5    14.25      X         3
    1       2       16.75   17.5    18.25      X         1
    1       2       17.75   18.5    19.25      X         2
    1       2       18.75   19.5    20.25      X         3

I'm having two troubles.. I don't know how and where to include the portion that calculates the mean ALONG with the 25th, 50th and 75th percentile, and converting it to a long format..

like image 795
imperialgendarme Avatar asked Jan 02 '23 02:01

imperialgendarme


2 Answers

Using describe:

df.groupby(['c1', 'c2']).describe().stack(level=0)[['25%', '50%', '75%', 'mean']]
like image 111
chadlagore Avatar answered Jan 04 '23 15:01

chadlagore


Define wrapper functions for quantile, then pass in a list of calculations (including mean):

def q1(x):
    return x.quantile(0.25)

def q2(x):
    return x.median()

def q3(x):
    return x.quantile(0.75)

df.groupby(['c1', 'c2']).agg(['mean', q1, q2, q3]).stack(level=0)

          mean     q1    q2     q3
c1 c2                             
1  1  p1  11.5  10.75  11.5  12.25
      p2  12.5  11.75  12.5  13.25
      p3  13.5  12.75  13.5  14.25
   2  p1  17.5  16.75  17.5  18.25
      p2  18.5  17.75  18.5  19.25
      p3  19.5  18.75  19.5  20.25
   3  p1  23.5  22.75  23.5  24.25
      p2  24.5  23.75  24.5  25.25
      p3  25.5  24.75  25.5  26.25
   4  p1  34.0  31.00  34.0  37.00
      p2  35.0  32.00  35.0  38.00
      p3  36.0  33.00  36.0  39.00

To get your exact desired output (no MultiIndex and column renamed to p), add this to the end of the method chain:

.reset_index().rename(columns={"level_2":"p"})

Note: This answer is largely inspired by Wen's answer here.

like image 25
andrew_reece Avatar answered Jan 04 '23 16:01

andrew_reece