Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering histogram edges and counts

Consider a histogram calculation of a numpy array that returns percentages:

# 500 random numbers between 0 and 10,000
values = np.random.uniform(0,10000,500)

# Histogram using e.g. 200 buckets
perc, edges = np.histogram(values, bins=200,
                           weights=np.zeros_like(values) + 100/values.size)

The above returns two arrays:

  • perc containing the % (i.e. percentages) of values within each pair of consecutive edges[ix] and edges[ix+1] out of the total.
  • edges of length len(hist)+1

Now, say that I want to filter perc and edges so that I only end up with the percentages and edges for values contained within a new range [m, M]. '

That is, I want to work with the sub-arrays of perc and edges corresponding to the interval of values within [m, M]. Needless to say, the new array of percentages would still refer to the total fraction count of the input array. We just want to filter perc and edges to end up with the correct sub-arrays.

How can I post-process perc and edges to do so?

The values of m and M can be any number of course. In the example above, we can assume e.g. m = 0 and M = 200.

like image 610
Amelio Vazquez-Reina Avatar asked Oct 19 '22 15:10

Amelio Vazquez-Reina


2 Answers

m = 0; M = 200
mask = [(m < edges) & (edges < M)]
>>> edges[mask]
array([  37.4789683 ,   87.07491593,  136.67086357,  186.2668112 ])

Let's work on a smaller dataset so that it is easier to understand:

np.random.seed(0)
values = np.random.uniform(0, 100, 10)
values.sort()
>>> values
array([ 38.34415188,  42.36547993,  43.75872113,  54.4883183 ,
        54.88135039,  60.27633761,  64.58941131,  71.51893664,
        89.17730008,  96.36627605])

# Histogram using e.g. 10 buckets
perc, edges = np.histogram(values, bins=10,
                           weights=np.zeros_like(values) + 100./values.size)

>>> perc
array([ 30.,   0.,  20.,  10.,  10.,  10.,   0.,   0.,  10.,  10.])

>>> edges
array([ 38.34415188,  44.1463643 ,  49.94857672,  55.75078913,
        61.55300155,  67.35521397,  73.15742638,  78.9596388 ,
        84.76185122,  90.56406363,  96.36627605])

m = 0; M = 50
mask = (m <= edges) & (edges < M)
>>> mask
array([ True,  True,  True, False, False, False, False, False, False,
       False, False], dtype=bool)

>>> edges[mask]
array([ 38.34415188,  44.1463643 ,  49.94857672])

>>> perc[mask[:-1]][:-1]
array([ 30.,   0.])

m = 40; M = 60
mask = (m < edges) & (edges < M)
>>> edges[mask]
array([ 44.1463643 ,  49.94857672,  55.75078913])
>>> perc[mask[:-1]][:-1]
array([  0.,  20.])
like image 115
Alexander Avatar answered Oct 31 '22 09:10

Alexander


Well you might need some mathematics for this. The bins are equally spaced so you can determine which bin is the first to include and which is the last by using the width of each bin:

bin_width = edges[1] - edges[0]

Now compute the first and last valid bin:

first = math.floor((m - edges[0]) / bin_width) + 1 # How many bins from the left
last = math.floor((edges[-1] - M) / bin_width) + 1 # How many bins from the right

(Ignore the +1 for both if you want to include the bin containing m or M - but then be careful that you don't end up with negative values for first and last!)

Now you know how many bins to include:

valid_edges = edges[first:-last]
valid_perc = perc[first:-last]

This will exclude the first first points and the last last points.

Might be that I haven't payed enough attention to rounding and there is an "off by one" error included but I think the idea is sound. :-)

You probably need to catch special cases like M > edges[-1] but for readability I haven't included these.


Or if the bins are not equally spaced use boolean masks instead of the calculation:

first = edged[edges < m].size + 1
last = edged[edges > M].size + 1
like image 41
MSeifert Avatar answered Oct 31 '22 09:10

MSeifert