Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas qcut behavior with # of observations not divisible by # of bins

Suppose I had a pandas series of dollar values and wanted to discretize into 9 groups using qcut. The # of observations is not divisible by 9. SQL Server's ntile function has a standard approach for this case: it makes the first n out of 9 groups 1 observation larger than the remaining (9-n) groups.

I noticed in pandas that the assignment of which groups had x observations vs x + 1 observations seemed random. I tried to decipher the code in algos to figure out how the quantile function deals with this issue but could not figure it out.

I have three related questions:

  1. Any pandas developers out there than can explain qcut's behavior? Is it random which groups get the larger number of observations?
  2. Is there a way to force qcut to behave similarly to NTILE (i.e., first groups get x + 1 observations)?
  3. If the answer to #2 is no, any ideas on a function that would behave like NTILE? (If this is a complicated endeavor, just an outline to your approach would be helpful.)

Here is an example of SQL Server's NTILE output.

Bin |# Observations
1   26
2   26
3   26
4   26
5   26
6   26
7   26
8   25
9   25

Here is pandas:

Bin |# Observations
1   26
2   26
3   26
4   25 (Why is this 25 vs others?)
5   26
6   26
7   25 (Why is this 25 vs others?)
8   26
9   26
like image 694
AllenQ Avatar asked Dec 22 '13 05:12

AllenQ


1 Answers

The qcut behaves like this because it's more accurate. Here is an example:

for the ith level, it starts at quantile (i-1)*10%:

import pandas as pd
import numpy as np

a = np.random.rand(26*10+3)
r = pd.qcut(a, 10)
np.bincount(r.labels)

the output is:

array([27, 26, 26, 26, 27, 26, 26, 26, 26, 27])

If you want NTILE, you can calculate the quantiles yourself:

n = len(a)
ngroup = 10

counts = np.ones(ngroup, int)*(n//ngroup)
counts[:n%ngroup] += 1

q = np.r_[0, np.cumsum(counts / float(n))]
q[-1] = 1.0
r2 = pd.qcut(a, q)
np.bincount(r2.labels)

the output is:

array([27, 27, 27, 26, 26, 26, 26, 26, 26, 26])
like image 103
HYRY Avatar answered Oct 02 '22 12:10

HYRY