Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas qcut not putting equal number of observations into each bin

I have a data frame, from which I can select a column (series) as follows:

df:

            value_rank
275488          90
275490          35
275491          60
275492          23
275493          23
275494          34
275495          75
275496          40
275497          69
275498          14
275499          83
...             ...

value_rank is a previously created percentile rank from a larger data-set. What I am trying to do, is to create bins of this data set, e.g. quintile

pd.qcut(df.value_rank, 5, labels=False)


275488    4
275490    1
275491    3
275492    1
275493    1
275494    1
275495    3
275496    2
...      ...

This appears fine, as expected, but it isn't.

In fact, I have 1569 columns. The nearest number divisible by 5 bins is 1565 which should give 1565 / 5 = 313 observations in each bin. There are 4 extra records, so I would expect to have 4 bins with 314 observations, and one with 313 observations. Instead, I get this:

obs =  pd.qcut(df.value_rank, 5, labels=False)
obs.value_counts()

0    329
3    314
1    313
4    311
2    302

I have no nans in df, and cannot think of any reason why this is happening. Literally beginning to tear my hair out!

Here is a small example:

df:

            value_rank
286742               11
286835               53
286865               40
286930               31
286936               45
286955               27
287031               30
287111               36
287269               30
287310               18

pd.qcut gives this:

pd.qcut(df.value_rank, 5, labels = False).value_counts()
bin  count
1    3
4    2
3    2
0    2
2    1

There should be 2 observations in each bin, not 3 in bin 1 and 1 in bin 2!

like image 578
Carl Avatar asked Feb 03 '16 17:02

Carl


2 Answers

Just try with the below code :

pd.qcut(df.rank(method='first'),nbins)
like image 131
AnksG Avatar answered Oct 10 '22 12:10

AnksG


qcut is trying to compensate for repeating values. This is earlier to visualize if you return the bin limits along with your qcut results:

In [42]: test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
In [43]: test_series = pd.Series(test_list, name='value_rank')

In [49]: pd.qcut(test_series, 5, retbins=True, labels=False)
Out[49]:
(array([0, 0, 1, 1, 1, 2, 3, 3, 4, 4]),
 array([ 11. ,  25.2,  30. ,  33. ,  41. ,  53. ]))

You can see that there was no choice but to set the bin limit at 30, so qcut had to "steal" one from the expected values in the third bin and place them in the second. I'm thinking that this is just happening at a larger scale with your percentiles since you're basically condensing their ranks into a 1 to 100 scale. Any reason not to just run qcut directly on the data instead of the percentiles or return percentiles that have greater precision?

like image 28
Robert Rodkey Avatar answered Oct 10 '22 11:10

Robert Rodkey