How to qcut with non unique bin edges?

Tags:

pandas

My question is the same as this previous one:

Binning with zero values in pandas

however, I still want to include the 0 values in a fractile. Is there a way to do this? In other words, if I have 600 values, 50% of which are 0, and the rest are let's say between 1 and 100, how would I categorize all the 0 values in fractile 1, and then the rest of the non-zero values in fractile labels 2 to 10 (assuming I want 10 fractiles). Could I convert the 0's to nan, qcut the remaining non nan data into 9 fractiles (1 to 9), then add 1 to each label (now 2 to 10) and label all the 0 values as fractile 1 manually? Even this is tricky, because in my data set in addition to the 600 values, I also have another couple hundred which may already be nan before I would convert the 0s to nan.

Update 1/26/14:

I came up with the following interim solution. The problem with this code though, is if the high frequency value is not on the edges of the distribution, then it inserts an extra bin in the middle of the existing set of bins and throws everything a little (or a lot) off.

def fractile_cut(ser, num_fractiles):     num_valid = ser.valid().shape[0]     remain_fractiles = num_fractiles     vcounts = ser.value_counts()     high_freq = []     i = 0     while vcounts.iloc[i] > num_valid/ float(remain_fractiles):         curr_val = vcounts.index[i]         high_freq.append(curr_val)         remain_fractiles -= 1         num_valid = num_valid - vcounts[i]         i += 1     curr_ser = ser.copy()     curr_ser = curr_ser[~curr_ser.isin(high_freq)]     qcut = pd.qcut(curr_ser, remain_fractiles, retbins=True)     qcut_bins = qcut[1]     all_bins = list(qcut_bins)     for val in high_freq:         bisect.insort(all_bins, val)     cut = pd.cut(ser, bins=all_bins)     ser_fractiles = pd.Series(cut.labels + 1, index=ser.index)     return ser_fractiles

423

asked Nov 23 '13 04:11

geronimo

2 Answers

The problem is that pandas.qcut chooses the bins/quantiles so that each one has the same number of records, but all records with the same value must stay in the same bin/quantile (this behaviour is in accordance with the statistical definition of quantile).

The solutions are:

1 - Use pandas >= 0.20.0 that has this fix. They added an option duplicates='raise'|'drop' to control whether to raise on duplicated edges or to drop them, which would result in less bins than specified, and some larger (with more elements) than others.

2 - Decrease the number of quantiles. Less quantiles means more elements per quantile

3 - Rank your data with DataFrame.rank(method='first'). The ranking assigns a unique value to each element in the dataframe (the rank) while keeping the order of the elements (except for identical values, which will be ranked in order they appear in the array, see method='first')

Example:

pd.qcut(df, nbins) <-- this generates "ValueError: Bin edges must be unique"

Then use this instead:

pd.qcut(df.rank(method='first'), nbins)

4 - Specify a custom quantiles range, e.g. [0, .50, .75, 1.] to get unequal number of items per quantile

5 - Use pandas.cut that chooses the bins to be evenly spaced according to the values themselves, while pandas.qcut chooses the bins so that you have the same number of records in each bin

113

answered Sep 25 '22 00:09

luca

Another way to do this is to introduce a minimal amount of noise, which will artificially create unique bin edges. Here's an example:

a = pd.Series(range(100) + ([0]*20))  def jitter(a_series, noise_reduction=1000000):     return (np.random.random(len(a_series))*a_series.std()/noise_reduction)-(a_series.std()/(2*noise_reduction))  # and now this works by adding a little noise a_deciles = pd.qcut(a + jitter(a), 10, labels=False)

we can recreate the original error using something like this:

a_deciles = pd.qcut(a, 10, labels=False) Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "/usr/local/lib/python2.7/site-packages/pandas/tools/tile.py", line 173, in qcut     precision=precision, include_lowest=True)   File "/usr/local/lib/python2.7/site-packages/pandas/tools/tile.py", line 192, in _bins_to_cuts     raise ValueError('Bin edges must be unique: %s' % repr(bins)) ValueError: Bin edges must be unique: array([  0.        ,   0.        ,   0.        ,   3.8       ,         11.73333333,  19.66666667,  27.6       ,  35.53333333,         43.46666667,  51.4       ,  59.33333333,  67.26666667,         75.2       ,  83.13333333,  91.06666667,  99.        ])

answered Sep 23 '22 00:09

mgoldwasser

Related questions
                            
                                Check if string has date, any format
                            
                                How do I tell PyLint "it's a variable, not a constant" to stop message C0103?
                            
                                Black formatter - Ignore specific multi-line code
                            
                                Assigning a function to a variable
                            
                                How to mock users and requests in django
                            
                                What is a "method" in Python?
                            
                                How do you send an HTTP Get Web Request in Python? [duplicate]
                            
                                Telling Python to save a .txt file to a certain directory on Windows and Mac
                            
                                Convert pandas DataFrame into list of lists [duplicate]
                            
                                Get unicode code point of a character using Python
                            
                                Why use **kwargs in python? What are some real world advantages over using named arguments?
                            
                                How to find names of all collections using PyMongo?
                            
                                Python Decimals format
                            
                                sql.h not found when installing PyODBC on Heroku
                            
                                Plotting multiple lines, in different colors, with pandas dataframe
                            
                                Getting "cannot write mode P as JPEG" while operating on JPG image
                            
                                How can I get the username of the logged-in user in Django?
                            
                                django - comparing old and new field value before saving
                            
                                Wrong math with Python?
                            
                                Tuple unpacking in for loops

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With