Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently get indices of histogram bins in Python

People also ask

How many bins should a histogram have Python?

The default value of the number of bins to be created in a histogram is 10. However, we can change the size of bins using the parameter bins in matplotlib.

How do you find the values of a histogram in Python?

To get the values from a histogram, plt. hist returns them, so all you have to do is save them.

What is NP digitize?

Syntax : np.digitize(Array, Bin, Right) Return : Return an array of indices of the bins. Example #1 : In this example we can see that by using np. digitize() method, we are able to get the array of indices of the bin of each value which belongs to an array by using this method.


I found that a particular sparse matrix constructor can achieve the desired result very efficiently. It's a bit obscure but we can abuse it for this purpose. The function below can be used in nearly the same way as scipy.stats.binned_statistic but can be orders of magnitude faster

import numpy as np
from scipy.sparse import csr_matrix

def binned_statistic(x, values, func, nbins, range):
    '''The usage is nearly the same as scipy.stats.binned_statistic''' 

    N = len(values)
    r0, r1 = range

    digitized = (float(nbins)/(r1 - r0)*(x - r0)).astype(int)
    S = csr_matrix((values, [digitized, np.arange(N)]), shape=(nbins, N))

    return [func(group) for group in np.split(S.data, S.indptr[1:-1])]

I avoided np.digitize because it doesn't use the fact that all bins are equal width and hence is slow, but the method I used instead may not handle all edge cases perfectly.


I assume that the binning, done in the example with digitize, cannot be changed. This is one way to go, where you do the sorting once and for all.

vals = np.random.random(1e4)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

new_order = argsort(ind)
ind = ind[new_order]
ordered_vals = vals[new_order]
# slower way of calculating first_hit (first version of this post)
# _,first_hit = unique(ind,return_index=True)
# faster way:
first_hit = searchsorted(ind,arange(1,nbins-1))
first_hit.sort()

#example of using the data:
for j in range(nbins-1):
    #I am using a plotting function for your f, to show that they cluster
    plot(ordered_vals[first_hit[j]:first_hit[j+1]],'o')

The figure shows that the bins are actually clusters as expected: enter image description here


You can halve the computation time by sorting the array first, then use np.searchsorted.

vals = np.random.random(1e8)
vals.sort()

nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

results = [func(vals[np.searchsorted(ind,j,side='left'):
                     np.searchsorted(ind,j,side='right')])
           for j in range(1,nbins)]

Using 1e8 as my test case, I go from 34 seconds of computation to about 17.


One efficient solution is using the numpy_indexed package (disclaimer: I am its author):

import numpy_indexed as npi
npi.group_by(ind).split(vals)