Suppose I had a pandas series of dollar values and wanted to discretize into 9 groups using qcut
. The # of observations is not divisible by 9. SQL Server's ntile
function has a standard approach for this case: it makes the first n out of 9 groups 1 observation larger than the remaining (9-n) groups.
I noticed in pandas that the assignment of which groups had x observations vs x + 1 observations seemed random. I tried to decipher the code in algos to figure out how the quantile function deals with this issue but could not figure it out.
I have three related questions:
qcut
's behavior? Is it random which groups get the larger number of observations?qcut
to behave similarly to NTILE
(i.e., first groups get x + 1 observations)?NTILE
? (If this is a complicated endeavor, just an outline to your approach would be helpful.)Here is an example of SQL Server's NTILE
output.
Bin |# Observations
1 26
2 26
3 26
4 26
5 26
6 26
7 26
8 25
9 25
Here is pandas:
Bin |# Observations
1 26
2 26
3 26
4 25 (Why is this 25 vs others?)
5 26
6 26
7 25 (Why is this 25 vs others?)
8 26
9 26
The qcut
behaves like this because it's more accurate. Here is an example:
for the ith level, it starts at quantile (i-1)*10%:
import pandas as pd
import numpy as np
a = np.random.rand(26*10+3)
r = pd.qcut(a, 10)
np.bincount(r.labels)
the output is:
array([27, 26, 26, 26, 27, 26, 26, 26, 26, 27])
If you want NTILE, you can calculate the quantiles yourself:
n = len(a)
ngroup = 10
counts = np.ones(ngroup, int)*(n//ngroup)
counts[:n%ngroup] += 1
q = np.r_[0, np.cumsum(counts / float(n))]
q[-1] = 1.0
r2 = pd.qcut(a, q)
np.bincount(r2.labels)
the output is:
array([27, 27, 27, 26, 26, 26, 26, 26, 26, 26])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With