Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Pandas qcut give me unequal sized bins?

Tags:

python

pandas

Pandas docs have this to say about the qcut function:

Discretize variable into equal-sized buckets based on rank or based on sample quantiles.

So I would expect this code to give me 4 bins of 10 values each:

import numpy as np
import pandas as pd

np.random.seed(4242)

y = pd.Series(np.random.randint(low=1, high=10, size=40))
quartiles = pd.qcut(y, 4, labels=['1st', '2nd', '3rd', '4th'])

print('Quartiles:')
print(quartiles.value_counts(sort=False))

y.groupby(quartiles).agg(['count', 'mean']).plot(kind='bar');

But instead I get this:

Quartiles:
1st    14
2nd     6
3rd    11
4th     9
dtype: int64

graph

What am I doing wrong here?

like image 956
skagr Avatar asked Jun 19 '17 17:06

skagr


1 Answers

The reason this happens is because python doesn't know how to deal with 'boundary-line' cases, i.e. records that may fit first and second quartiles easily. A simple adjustment to your code will produce the desired result:

import numpy as np
import pandas as pd

np.random.seed(4242)

y = pd.Series(np.random.randint(low=1, high=10, size=40))
quartiles = pd.qcut(y.rank(method = 'first'), 4, labels=['1st', '2nd', '3rd', '4th'])

print('Quartiles:')
print(quartiles.value_counts(sort=False))

y.groupby(quartiles).agg(['count', 'mean']).plot(kind='bar');

By stating the approach to be used by python using the rank() function, we give python a clear approach to handling records that cut across multiple bins. In this case, I've used (method = 'first') as the argument for the rank() function.

The output I get is as follows:

Quartiles:
1st    10
2nd    10
3rd    10
4th    10
dtype: int64
like image 170
SQLGIT_GeekInTraining Avatar answered Oct 03 '22 22:10

SQLGIT_GeekInTraining