Why does Pandas qcut give me unequal sized bins?

Question

Pandas docs have this to say about the qcut function:

Discretize variable into equal-sized buckets based on rank or based on sample quantiles.

So I would expect this code to give me 4 bins of 10 values each:

import numpy as np
import pandas as pd

np.random.seed(4242)

y = pd.Series(np.random.randint(low=1, high=10, size=40))
quartiles = pd.qcut(y, 4, labels=['1st', '2nd', '3rd', '4th'])

print('Quartiles:')
print(quartiles.value_counts(sort=False))

y.groupby(quartiles).agg(['count', 'mean']).plot(kind='bar');

But instead I get this:

Quartiles:
1st    14
2nd     6
3rd    11
4th     9
dtype: int64

graph

What am I doing wrong here?

SQLGIT_GeekInTraining · Accepted Answer

The reason this happens is because python doesn't know how to deal with 'boundary-line' cases, i.e. records that may fit first and second quartiles easily. A simple adjustment to your code will produce the desired result:

import numpy as np
import pandas as pd

np.random.seed(4242)

y = pd.Series(np.random.randint(low=1, high=10, size=40))
quartiles = pd.qcut(y.rank(method = 'first'), 4, labels=['1st', '2nd', '3rd', '4th'])

print('Quartiles:')
print(quartiles.value_counts(sort=False))

y.groupby(quartiles).agg(['count', 'mean']).plot(kind='bar');

By stating the approach to be used by python using the rank() function, we give python a clear approach to handling records that cut across multiple bins. In this case, I've used (method = 'first') as the argument for the rank() function.

The output I get is as follows:

Quartiles:
1st    10
2nd    10
3rd    10
4th    10
dtype: int64

Why does Pandas qcut give me unequal sized bins?

Tags:

python

pandas

skagr

1 Answers

SQLGIT_GeekInTraining

Recent Activity

Donate For Us

Why does Pandas qcut give me unequal sized bins?

Tags:

python

pandas

skagr

1 Answers

SQLGIT_GeekInTraining

Related questions

Recent Activity

Donate For Us