Imagine that I have a DataFrame
with columns that contain only real values.
>> df
col1 col2 col3
0 0.907609 82 4.207991
1 3.743659 1523 6.488842
2 2.358696 324 5.092592
3 0.006793 0 0.000000
4 19.319746 11969 7.405685
I want to group it by quartiles (or any other percentiles specified by me) of the chosen column (e.g., col1
), to perform some operations on these groups. Ideally, I would like to do something like:
df.groupy( quartiles_of_col1 ).mean() # not working, how to code quartiles_of_col1?
The output should give the mean of each of the columns for four groups corresponding to the quartiles of col1
. Is this possible with the groupby
command? What's the simplest way of achieving it?
Percentiles are given as percent values, values such as 95%, 40%, or 27%. Quantiles are given as decimal values, values such as 0.95, 0.4, and 0.27. The 0.95 quantile point is exactly the same as the 95th percentile point.
The describe() method returns description of the data in the DataFrame. If the DataFrame contains numerical data, the description contains these information for each column: count - The number of not-empty values. mean - The average (mean) value.
groupby('Category'). field_A. quantile(0.1) . That will return the 10th percentile for each group of Category .
I don't have a computer to test it right now, but I think you can do it by: df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean()
. Will update after 150mins.
Some explanations:
In [42]:
#use np.percentile to get the bin edges of any percentile you want
np.percentile(df.col0, [0, 25, 75, 90, 100])
Out[42]:
[0.0067930000000000004,
0.907609,
3.7436589999999996,
13.089311200000001,
19.319745999999999]
In [43]:
#Need to use include_lowest=True
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean()
col0 col1 col2
col0
[0.00679, 0.908] 0.457201 41.0 2.103996
(0.908, 3.744] 3.051177 923.5 5.790717
(3.744, 13.0893] NaN NaN NaN
(13.0893, 19.32] 19.319746 11969.0 7.405685
In [44]:
#Or the smallest values will be skiped
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]))).mean()
col0 col1 col2
col0
(0.00679, 0.908] 0.907609 82.0 4.207991
(0.908, 3.744] 3.051177 923.5 5.790717
(3.744, 13.0893] NaN NaN NaN
(13.0893, 19.32] 19.319746 11969.0 7.405685
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With