Efficiently get indices of histogram bins in Python

People also ask

How many bins should a histogram have Python?

The default value of the number of bins to be created in a histogram is 10. However, we can change the size of bins using the parameter bins in matplotlib.

How do you find the values of a histogram in Python?

To get the values from a histogram, plt. hist returns them, so all you have to do is save them.

What is NP digitize?

Syntax : np.digitize(Array, Bin, Right) Return : Return an array of indices of the bins. Example #1 : In this example we can see that by using np. digitize() method, we are able to get the array of indices of the bin of each value which belongs to an array by using this method.

I found that a particular sparse matrix constructor can achieve the desired result very efficiently. It's a bit obscure but we can abuse it for this purpose. The function below can be used in nearly the same way as scipy.stats.binned_statistic but can be orders of magnitude faster

import numpy as np
from scipy.sparse import csr_matrix

def binned_statistic(x, values, func, nbins, range):
    '''The usage is nearly the same as scipy.stats.binned_statistic''' 

    N = len(values)
    r0, r1 = range

    digitized = (float(nbins)/(r1 - r0)*(x - r0)).astype(int)
    S = csr_matrix((values, [digitized, np.arange(N)]), shape=(nbins, N))

    return [func(group) for group in np.split(S.data, S.indptr[1:-1])]

I avoided np.digitize because it doesn't use the fact that all bins are equal width and hence is slow, but the method I used instead may not handle all edge cases perfectly.

I assume that the binning, done in the example with digitize, cannot be changed. This is one way to go, where you do the sorting once and for all.

vals = np.random.random(1e4)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

new_order = argsort(ind)
ind = ind[new_order]
ordered_vals = vals[new_order]
# slower way of calculating first_hit (first version of this post)
# _,first_hit = unique(ind,return_index=True)
# faster way:
first_hit = searchsorted(ind,arange(1,nbins-1))
first_hit.sort()

#example of using the data:
for j in range(nbins-1):
    #I am using a plotting function for your f, to show that they cluster
    plot(ordered_vals[first_hit[j]:first_hit[j+1]],'o')

The figure shows that the bins are actually clusters as expected: enter image description here

You can halve the computation time by sorting the array first, then use np.searchsorted.

vals = np.random.random(1e8)
vals.sort()

nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

results = [func(vals[np.searchsorted(ind,j,side='left'):
                     np.searchsorted(ind,j,side='right')])
           for j in range(1,nbins)]

Using 1e8 as my test case, I go from 34 seconds of computation to about 17.

One efficient solution is using the numpy_indexed package (disclaimer: I am its author):

import numpy_indexed as npi
npi.group_by(ind).split(vals)

Related questions
                            
                                Describe your customized Vim editor for Python/Django development?
                            
                                How can I prevent sphinx from displaying the full path to my class?
                            
                                Detect if a python module changes and then reload
                            
                                SQLAlchemy and Multiple Databases
                            
                                Easy application logging/debugging with nginx, uwsgi, flask?
                            
                                how we can create android apk for python based application [duplicate]
                            
                                Java's TreeSet equivalent in Python?
                            
                                What is the correct way to override the __dir__ method?
                            
                                Non-blocking file access with Twisted
                            
                                Distribute a Python package with a compiled dynamic shared library
                            
                                Integration of Python console into a GUI C++ application
                            
                                why is converting a long 2D list to numpy array so slow?
                            
                                How does Spark running on YARN account for Python memory usage?
                            
                                Python: argparse optional arguments without dashes
                            
                                Python 2 and Python 3 dual development
                            
                                How to list dependencies for a python library without installing? [duplicate]
                            
                                Airflow Python Unit Test?
                            
                                celery shutdown worker after particular task
                            
                                Parsing HTML in Python [closed]
                            
                                How to check whether string might be type-cast to float in Python? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiently get indices of histogram bins in Python

Tags:

python

numpy

scipy

People also ask

Recent Activity

Donate For Us