Consider a histogram calculation of a numpy array that returns percentages:
# 500 random numbers between 0 and 10,000
values = np.random.uniform(0,10000,500)
# Histogram using e.g. 200 buckets
perc, edges = np.histogram(values, bins=200,
weights=np.zeros_like(values) + 100/values.size)
The above returns two arrays:
perc
containing the %
(i.e. percentages) of values within each pair of consecutive edges[ix]
and edges[ix+1]
out of the total.edges
of length len(hist)+1
Now, say that I want to filter perc
and edges
so that I only end up with the percentages and edges for values contained within a new range [m, M]
. '
That is, I want to work with the sub-arrays of perc
and edges
corresponding to the interval of values within [m, M]
. Needless to say, the new array of percentages would still refer to the total fraction count of the input array. We just want to filter perc
and edges
to end up with the correct sub-arrays.
How can I post-process perc
and edges
to do so?
The values of m
and M
can be any number of course. In the example above, we can assume e.g. m = 0
and M = 200
.
m = 0; M = 200
mask = [(m < edges) & (edges < M)]
>>> edges[mask]
array([ 37.4789683 , 87.07491593, 136.67086357, 186.2668112 ])
Let's work on a smaller dataset so that it is easier to understand:
np.random.seed(0)
values = np.random.uniform(0, 100, 10)
values.sort()
>>> values
array([ 38.34415188, 42.36547993, 43.75872113, 54.4883183 ,
54.88135039, 60.27633761, 64.58941131, 71.51893664,
89.17730008, 96.36627605])
# Histogram using e.g. 10 buckets
perc, edges = np.histogram(values, bins=10,
weights=np.zeros_like(values) + 100./values.size)
>>> perc
array([ 30., 0., 20., 10., 10., 10., 0., 0., 10., 10.])
>>> edges
array([ 38.34415188, 44.1463643 , 49.94857672, 55.75078913,
61.55300155, 67.35521397, 73.15742638, 78.9596388 ,
84.76185122, 90.56406363, 96.36627605])
m = 0; M = 50
mask = (m <= edges) & (edges < M)
>>> mask
array([ True, True, True, False, False, False, False, False, False,
False, False], dtype=bool)
>>> edges[mask]
array([ 38.34415188, 44.1463643 , 49.94857672])
>>> perc[mask[:-1]][:-1]
array([ 30., 0.])
m = 40; M = 60
mask = (m < edges) & (edges < M)
>>> edges[mask]
array([ 44.1463643 , 49.94857672, 55.75078913])
>>> perc[mask[:-1]][:-1]
array([ 0., 20.])
Well you might need some mathematics for this. The bins are equally spaced so you can determine which bin is the first to include and which is the last by using the width of each bin:
bin_width = edges[1] - edges[0]
Now compute the first and last valid bin:
first = math.floor((m - edges[0]) / bin_width) + 1 # How many bins from the left
last = math.floor((edges[-1] - M) / bin_width) + 1 # How many bins from the right
(Ignore the +1 for both if you want to include the bin containing m
or M
- but then be careful that you don't end up with negative values for first and last!)
Now you know how many bins to include:
valid_edges = edges[first:-last]
valid_perc = perc[first:-last]
This will exclude the first first
points and the last last
points.
Might be that I haven't payed enough attention to rounding and there is an "off by one" error included but I think the idea is sound. :-)
You probably need to catch special cases like M > edges[-1]
but for readability I haven't included these.
Or if the bins are not equally spaced use boolean masks instead of the calculation:
first = edged[edges < m].size + 1
last = edged[edges > M].size + 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With