Logo Questions Linux Laravel Mysql Ubuntu Git Menu

What is the difference between pandas.qcut and pandas.cut?




The documentation says:


"Continuous values can be discretized using the cut (bins based on values) and qcut (bins based on sample quantiles) functions"

Sounds very abstract to me... I can see the differences in the example below but what does qcut (sample quantile) actually do/mean? When would you use qcut versus cut?


factors = np.random.randn(30)  In [11]: pd.cut(factors, 5) Out[11]: [(-0.411, 0.575], (-0.411, 0.575], (-0.411, 0.575], (-0.411, 0.575], (0.575, 1.561], ..., (-0.411, 0.575], (-1.397, -0.411], (0.575, 1.561], (-2.388, -1.397], (-0.411, 0.575]] Length: 30 Categories (5, object): [(-2.388, -1.397] < (-1.397, -0.411] < (-0.411, 0.575] < (0.575, 1.561] < (1.561, 2.547]]  In [14]: pd.qcut(factors, 5) Out[14]: [(-0.348, 0.0899], (-0.348, 0.0899], (0.0899, 1.19], (0.0899, 1.19], (0.0899, 1.19], ..., (0.0899, 1.19], (-1.137, -0.348], (1.19, 2.547], [-2.383, -1.137], (-0.348, 0.0899]] Length: 30 Categories (5, object): [[-2.383, -1.137] < (-1.137, -0.348] < (-0.348, 0.0899] < (0.0899, 1.19] < (1.19, 2.547]]` 
like image 450
WillZ Avatar asked May 13 '15 10:05


People also ask

What is Panda cut?

Pandas cut() function is used to separate the array elements into different bins . The cut function is mainly used to perform statistical analysis on scalar data.

How do you split data into bins in Python?

Use pd. cut() for binning data based on the range of possible values. Use pd. qcut() for binning data based on the actual distribution of values.

How do I cut columns in pandas?

To delete rows and columns from DataFrames, Pandas uses the “drop” function. To delete a column, or multiple columns, use the name of the column(s), and specify the “axis” as 1. Alternatively, as in the example below, the 'columns' parameter has been added in Pandas which cuts out the need for 'axis'.

How do you binning in Python?

Smoothing by bin means : In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. Smoothing by bin median : In this method each bin value is replaced by its bin median value.

1 Answers

To begin, note that quantiles is just the most general term for things like percentiles, quartiles, and medians. You specified five bins in your example, so you are asking qcut for quintiles.

So, when you ask for quintiles with qcut, the bins will be chosen so that you have the same number of records in each bin. You have 30 records, so should have 6 in each bin (your output should look like this, although the breakpoints will differ due to the random draw):

pd.qcut(factors, 5).value_counts()  [-2.578, -0.829]    6 (-0.829, -0.36]     6 (-0.36, 0.366]      6 (0.366, 0.868]      6 (0.868, 2.617]      6 

Conversely, for cut you will see something more uneven:

pd.cut(factors, 5).value_counts()  (-2.583, -1.539]    5 (-1.539, -0.5]      5 (-0.5, 0.539]       9 (0.539, 1.578]      9 (1.578, 2.617]      2 

That's because cut will choose the bins to be evenly spaced according to the values themselves and not the frequency of those values. Hence, because you drew from a random normal, you'll see higher frequencies in the inner bins and fewer in the outer. This is essentially going to be a tabular form of a histogram (which you would expect to be fairly bell shaped with 30 records).

like image 139
JohnE Avatar answered Sep 20 '22 19:09
