Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Merge Bins

I have created a distribution using numpy histogram and digitize functions.

_, bins = np.histogram(x, bins=bins)
arr = np.digitize(x, bins) - 1
x = bins[arr[:]]

Or possibly:

x = pandas.cut(x, bins=bins)

However as the distribution is very skewed, even after removing outliers, there are many bins with very little observations. I want to merge bins, somewhat similar to:

How to merge bins in R

The procedure would possibly involve pandas groupby and then merging the group sizes less than n to their neighbouring values. Is there a way to achieve this in pandas/numpy?

like image 384
hangc Avatar asked Dec 16 '25 13:12

hangc


1 Answers

As promised, I implemented something in physt, version 0.3.5. You're welcome to use it.

See http://nbviewer.jupyter.org/github/janpipek/physt/blob/master/doc/Binning2.ipynb#Merging-bins and particularly http://nbviewer.jupyter.org/github/janpipek/physt/blob/master/doc/Binning2.ipynb#By-min-frequency

In your case, the workflow would be something like this:

import physt
histogram = physt.h1(x, bins=bins)
histogram.merge_bins(min_frequency=n)
bins = histogram.numpy_bins 

Note that the code is in alpha stage and not each bin contains more than the required minimum (in order to preserve tall narrow bins). The best algorithm is still being looked for.

like image 195
honza_p Avatar answered Dec 19 '25 03:12

honza_p



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!