I have an array (not sorted) of N elements. I'd like to keep the original order of N, but instead of the actual elements, I'd like them to have their bin numbers, where N is split into m bins of equal (if N is divisible by m) or nearly equal (N not divisible by m) values. I need a vectorized solution (since N is fairly large, so standard python methods won't be efficient). Is there anything in scipy or numpy that can do this?
e.g.
N = [0.2, 1.5, 0.3, 1.7, 0.5]
m = 2
Desired output: [0, 1, 0, 1, 0]
I've looked at numpy.histogram, but it doesn't give me unequally spaced bins.
A Simple solution is to run two loop to split array and check it is possible to split array into two parts such that sum of first_part equal to sum of second_part. Below is the implementation of above idea.
Listed in this post is a NumPy based vectorized approach with the idea of creating equally spaced indices for the length of the input array using np.searchsorted
-
Here's the implementation -
def equal_bin(N, m):
sep = (N.size/float(m))*np.arange(1,m+1)
idx = sep.searchsorted(np.arange(N.size))
return idx[N.argsort().argsort()]
Sample runs with bin-counting for each bin to verify results -
In [442]: N = np.arange(1,94)
In [443]: np.bincount(equal_bin(N, 4))
Out[443]: array([24, 23, 23, 23])
In [444]: np.bincount(equal_bin(N, 5))
Out[444]: array([19, 19, 18, 19, 18])
In [445]: np.bincount(equal_bin(N, 10))
Out[445]: array([10, 9, 9, 10, 9, 9, 10, 9, 9, 9])
Here's another approach using linspace
to create those equally spaced numbers that could be used as indices, like so -
def equal_bin_v2(N, m):
idx = np.linspace(0,m,N.size+0.5, endpoint=0).astype(int)
return idx[N.argsort().argsort()]
Sample run -
In [689]: N
Out[689]: array([ 0.2, 1.5, 0.3, 1.7, 0.5])
In [690]: equal_bin_v2(N,2)
Out[690]: array([0, 1, 0, 1, 0])
In [691]: equal_bin_v2(N,3)
Out[691]: array([0, 1, 0, 2, 1])
In [692]: equal_bin_v2(N,4)
Out[692]: array([0, 2, 0, 3, 1])
In [693]: equal_bin_v2(N,5)
Out[693]: array([0, 3, 1, 4, 2])
Another good alternative is the pd.qcut
from pandas. For example:
In [6]: import pandas as pd
In [7]: N = [0.2, 1.5, 0.3, 1.7, 0.5]
...: m = 2
In [8]: pd.qcut(N, m, labels=False)
Out[8]: array([0, 1, 0, 1, 0], dtype=int64)
If you want to return the bin edges, use labels=True
(default). This will allow you to get the bin middle points with:
In [26]: intervals = pd.qcut(N, 2)
In [27]: [i.mid for i in intervals]
Out[27]: [0.34950000000000003, 1.1, 0.34950000000000003, 1.1, 0.34950000000000003]
The intervals is an array of pandas.Interval
objects (when labels=True
).
See also: pd.cut
, if you would like to make the bin width (not bin count) equal
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With