Here is an example DataFrame:
df = pd.DataFrame([[1, 1, 10, 11, 12],
[1, 1, 13, 14, 15],
[1, 2, 16, 17, 18],
[1, 2, 19, 20, 21],
[1, 3, 22, 23, 24],
[1, 3, 25, 26, 27],
[1, 4, 28, 29, 30],
[1, 4, 31, 32, 33],
[1, 4, 34, 35, 36],
[1, 4, 37, 38, 39],
[1, 4, 40, 41, 42]])
df.columns = ['c1', 'c2', 'p1', 'p2', 'p3']
print(df)
Gives:
c1 c2 p1 p2 p3
0 1 1 10 11 12
1 1 1 13 14 15
2 1 2 16 17 18
3 1 2 19 20 21
4 1 3 22 23 24
5 1 3 25 26 27
6 1 4 28 29 30
7 1 4 31 32 33
8 1 4 34 35 36
9 1 4 37 38 39
10 1 4 40 41 42
What I have done so far:
example = df.groupby(['c1', 'c2'])['p1', 'p2', 'p3'].quantile([0.25, 0.50, 0.75]).unstack().reset_index()
print(example)
Gives:
c1 c2 p1 p2 p3
0.25 0.5 0.75 0.25 0.5 0.75 0.25 0.5 0.75
0 1 1 10.75 11.5 12.25 11.75 12.5 13.25 12.75 13.5 14.25
1 1 2 16.75 17.5 18.25 17.75 18.5 19.25 18.75 19.5 20.25
2 1 3 22.75 23.5 24.25 23.75 24.5 25.25 24.75 25.5 26.25
3 1 4 31.00 34.0 37.00 32.00 35.0 38.00 33.00 36.0 39.00
The output I have above is CORRECT to find the percentiles, but I also want the Average/Mean
+ The above format is in wide
format, I would like it to be in long
format.
So,
In the wide
format, I would want another column called average
c1 c2 p1 p2 p3
0.25 0.5 0.75 average 0.25 0.5 0.75 average 0.25 0.5 0.75 average
0 1 1 10.75 11.5 12.25 X 11.75 12.5 13.25 X 12.75 13.5 14.25 X
1 1 2 16.75 17.5 18.25 X 17.75 18.5 19.25 X 18.75 19.5 20.25 X
2 1 3 22.75 23.5 24.25 X 23.75 24.5 25.25 X 24.75 25.5 26.25 X
3 1 4 31.00 34.0 37.00 X 32.00 35.0 38.00 X 33.00 36.0 39.00 X
The final OUTPUT that I'm looking for is the above table in long
format like so:
c1 c2 0.25 0.50 0.75 average p
1 1 10.75 11.5 12.25 X 1
1 1 11.75 12.5 13.25 X 2
1 1 2.75 13.5 14.25 X 3
1 2 16.75 17.5 18.25 X 1
1 2 17.75 18.5 19.25 X 2
1 2 18.75 19.5 20.25 X 3
I'm having two troubles.. I don't know how and where to include the portion that calculates the mean
ALONG with the 25th, 50th and 75th percentile
, and converting it to a long
format..
Using describe
:
df.groupby(['c1', 'c2']).describe().stack(level=0)[['25%', '50%', '75%', 'mean']]
Define wrapper functions for quantile
, then pass in a list of calculations (including mean
):
def q1(x):
return x.quantile(0.25)
def q2(x):
return x.median()
def q3(x):
return x.quantile(0.75)
df.groupby(['c1', 'c2']).agg(['mean', q1, q2, q3]).stack(level=0)
mean q1 q2 q3
c1 c2
1 1 p1 11.5 10.75 11.5 12.25
p2 12.5 11.75 12.5 13.25
p3 13.5 12.75 13.5 14.25
2 p1 17.5 16.75 17.5 18.25
p2 18.5 17.75 18.5 19.25
p3 19.5 18.75 19.5 20.25
3 p1 23.5 22.75 23.5 24.25
p2 24.5 23.75 24.5 25.25
p3 25.5 24.75 25.5 26.25
4 p1 34.0 31.00 34.0 37.00
p2 35.0 32.00 35.0 38.00
p3 36.0 33.00 36.0 39.00
To get your exact desired output (no MultiIndex and column renamed to p
), add this to the end of the method chain:
.reset_index().rename(columns={"level_2":"p"})
Note: This answer is largely inspired by Wen's answer here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With