Groupby given percentiles of the values of the chosen DataFrame column

Tags:

Imagine that I have a DataFrame with columns that contain only real values.

>> df        
          col1   col2      col3  
0     0.907609     82  4.207991 
1     3.743659   1523  6.488842 
2     2.358696    324  5.092592  
3     0.006793      0  0.000000  
4    19.319746  11969  7.405685

I want to group it by quartiles (or any other percentiles specified by me) of the chosen column (e.g., col1), to perform some operations on these groups. Ideally, I would like to do something like:

df.groupy( quartiles_of_col1 ).mean()  # not working, how to code quartiles_of_col1?

The output should give the mean of each of the columns for four groups corresponding to the quartiles of col1. Is this possible with the groupby command? What's the simplest way of achieving it?

640

asked Jul 09 '14 15:07

pms

1 Answers

I don't have a computer to test it right now, but I think you can do it by: df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean(). Will update after 150mins.

Some explanations:

In [42]:
#use np.percentile to get the bin edges of any percentile you want 
np.percentile(df.col0, [0, 25, 75, 90, 100])
Out[42]:
[0.0067930000000000004,
 0.907609,
 3.7436589999999996,
 13.089311200000001,
 19.319745999999999]
In [43]:
#Need to use include_lowest=True
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean()
                       col0     col1      col2
col0                                          
[0.00679, 0.908]   0.457201     41.0  2.103996
(0.908, 3.744]     3.051177    923.5  5.790717
(3.744, 13.0893]        NaN      NaN       NaN
(13.0893, 19.32]  19.319746  11969.0  7.405685
In [44]:
#Or the smallest values will be skiped
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]))).mean()
                       col0     col1      col2
col0                                          
(0.00679, 0.908]   0.907609     82.0  4.207991
(0.908, 3.744]     3.051177    923.5  5.790717
(3.744, 13.0893]        NaN      NaN       NaN
(13.0893, 19.32]  19.319746  11969.0  7.405685

105

answered Nov 21 '22 10:11

CT Zhu

Related questions
                            
                                Generate Smooth White Border Around Circular Image
                            
                                How does python represent such large integers?
                            
                                numpy random.choice elements that are not selected
                            
                                legend in python networkx
                            
                                Python openCV: kmeans example not working
                            
                                Return two SqlAlchemy Columns concatenated
                            
                                What is the best way to take np.percentile along an axis ignoring nans?
                            
                                Can i write the output format created by prettytable into a file? [closed]
                            
                                Read response AT command with pySerial
                            
                                Automate compilation of protobuf specs into python classes in setup.py
                            
                                Calculate rolling time difference in pandas efficiently
                            
                                What is the pythonic way to bubble up error conditions
                            
                                Change default options in pandas
                            
                                Adding a legend outside of multiple subplots with matplotlib
                            
                                Python Pandas: Keeping only dataframe rows containing first occurrence of an item
                            
                                Dictionary of lists to Dictionary
                            
                                Why does an import not always import nested packages?
                            
                                How to install PL/Python on PostgreSQL 9.3 x64 Windows 7?
                            
                                What is the point of a naive datetime
                            
                                Python: Sklearn.linear_model.LinearRegression working weird

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Groupby given percentiles of the values of the chosen DataFrame column

Tags:

python

pandas

group-by

pms

People also ask

1 Answers

CT Zhu

Recent Activity

Donate For Us