Is there a way to calculate many histograms along an axis of an nD-array? The method I currently have uses a for loop to iterate over all other axes and calculate a numpy.histogram() for each resulting 1D array:
import numpy
import itertools
data = numpy.random.rand(4, 5, 6)
# axis=-1, place `200001` and `[slice(None)]` on any other position to process along other axes
out = numpy.zeros((4, 5, 200001), dtype="int64")
indices = [
    numpy.arange(4), numpy.arange(5), [slice(None)]
]
# Iterate over all axes, calculate histogram for each cell
for idx in itertools.product(*indices):
    out[idx] = numpy.histogram(
        data[idx],
        bins=2 * 100000 + 1,
        range=(-100000 - 0.5, 100000 + 0.5),
    )[0]
out.shape  # (4, 5, 200001)
Needless to say this is very slow, however I couldn't find a way to solve this using numpy.histogram, numpy.histogram2d or numpy.histogramdd.
Here's a vectorized approach making use of the efficient tools np.searchsorted and np.bincount. searchsorted gives us the loactions where each element is to be placed based on the bins and bincount does the counting for us.
Implementation -
def hist_laxis(data, n_bins, range_limits):
    # Setup bins and determine the bin location for each element for the bins
    R = range_limits
    N = data.shape[-1]
    bins = np.linspace(R[0],R[1],n_bins+1)
    data2D = data.reshape(-1,N)
    idx = np.searchsorted(bins, data2D,'right')-1
    # Some elements would be off limits, so get a mask for those
    bad_mask = (idx==-1) | (idx==n_bins)
    # We need to use bincount to get bin based counts. To have unique IDs for
    # each row and not get confused by the ones from other rows, we need to 
    # offset each row by a scale (using row length for this).
    scaled_idx = n_bins*np.arange(data2D.shape[0])[:,None] + idx
    # Set the bad ones to be last possible index+1 : n_bins*data2D.shape[0]
    limit = n_bins*data2D.shape[0]
    scaled_idx[bad_mask] = limit
    # Get the counts and reshape to multi-dim
    counts = np.bincount(scaled_idx.ravel(),minlength=limit+1)[:-1]
    counts.shape = data.shape[:-1] + (n_bins,)
    return counts
Runtime test
Original approach -
def org_app(data, n_bins, range_limits):
    R = range_limits
    m,n = data.shape[:2]
    out = np.zeros((m, n, n_bins), dtype="int64")
    indices = [
        np.arange(m), np.arange(n), [slice(None)]
    ]
    # Iterate over all axes, calculate histogram for each cell
    for idx in itertools.product(*indices):
        out[idx] = np.histogram(
            data[idx],
            bins=n_bins,
            range=(R[0], R[1]),
        )[0]
    return out
Timings and verification -
In [2]: data = np.random.randn(4, 5, 6)
   ...: out1 = org_app(data, n_bins=200001, range_limits=(- 2.5, 2.5))
   ...: out2 = hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5))
   ...: print np.allclose(out1, out2)
   ...: 
True
In [3]: %timeit org_app(data, n_bins=200001, range_limits=(- 2.5, 2.5))
10 loops, best of 3: 39.3 ms per loop
In [4]: %timeit hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5))
100 loops, best of 3: 3.17 ms per loop
Since, in the loopy solution, we are looping through the first two axes. So, let's increase their lengths as that would show us how good is the vectorized one -
In [59]: data = np.random.randn(400, 500, 6)
In [60]: %timeit org_app(data, n_bins=21, range_limits=(- 2.5, 2.5))
1 loops, best of 3: 9.59 s per loop
In [61]: %timeit hist_laxis(data, n_bins=21, range_limits=(- 2.5, 2.5))
10 loops, best of 3: 44.2 ms per loop
In [62]: 9590/44.2          # Speedup number
Out[62]: 216.9683257918552
The first solution provided a nice short idiom which uses numpy sortedsearch which has the cost of a sort and many searches.  But numpy has a fast route in its source code which is done in Python in fact, which can deal with equal bin edge range mathematically.  This solution uses only a vectorized subtraction and multiplication and some comparisons instead.
This solution will follow numpy code for the search sorted, type imputations, and handles weights as well as complex numbers. It is basically the first solution combined with numpy histogram fast route, and some extra type, and iteration details, etc.
_range = range
def hist_np_laxis(a, bins=10, range=None, weights=None):
    # Initialize empty histogram
    N = a.shape[-1]
    data2D = a.reshape(-1,N)
    limit = bins*data2D.shape[0]
    # gh-10322 means that type resolution rules are dependent on array
    # shapes. To avoid this causing problems, we pick a type now and stick
    # with it throughout.
    bin_type = np.result_type(range[0], range[1], a)
    if np.issubdtype(bin_type, np.integer):
        bin_type = np.result_type(bin_type, float)
    bin_edges = np.linspace(range[0],range[1],bins+1, endpoint=True, dtype=bin_type)
    # Histogram is an integer or a float array depending on the weights.
    if weights is None:
        ntype = np.dtype(np.intp)
    else:
        ntype = weights.dtype
    n = np.zeros(limit, ntype)
    # Pre-compute histogram scaling factor
    norm = bins / (range[1] - range[0])
    # We set a block size, as this allows us to iterate over chunks when
    # computing histograms, to minimize memory usage.
    BLOCK = 65536
    # We iterate over blocks here for two reasons: the first is that for
    # large arrays, it is actually faster (for example for a 10^8 array it
    # is 2x as fast) and it results in a memory footprint 3x lower in the
    # limit of large arrays.
    for i in _range(0, data2D.shape[0], BLOCK):
        tmp_a = data2D[i:i+BLOCK]
        block_size = tmp_a.shape[0]
        if weights is None:
            tmp_w = None
        else:
            tmp_w = weights[i:i + BLOCK]
        # Only include values in the right range
        keep = (tmp_a >= range[0])
        keep &= (tmp_a <= range[1])
        if not np.logical_and.reduce(np.logical_and.reduce(keep)):
            tmp_a = tmp_a[keep]
            if tmp_w is not None:
                tmp_w = tmp_w[keep]
        # This cast ensures no type promotions occur below, which gh-10322
        # make unpredictable. Getting it wrong leads to precision errors
        # like gh-8123.
        tmp_a = tmp_a.astype(bin_edges.dtype, copy=False)
        # Compute the bin indices, and for values that lie exactly on
        # last_edge we need to subtract one
        f_indices = (tmp_a - range[0]) * norm
        indices = f_indices.astype(np.intp)
        indices[indices == bins] -= 1
        # The index computation is not guaranteed to give exactly
        # consistent results within ~1 ULP of the bin edges.
        decrement = tmp_a < bin_edges[indices]
        indices[decrement] -= 1
        # The last bin includes the right edge. The other bins do not.
        increment = ((tmp_a >= bin_edges[indices + 1])
                     & (indices != bins - 1))
        indices[increment] += 1
        ((bins*np.arange(i, i+block_size)[:,None] * keep)[keep].reshape(indices.shape) + indices).reshape(-1)
        #indices = scaled_idx.reshape(-1)
        # We now compute the histogram using bincount
        if ntype.kind == 'c':
            n.real += np.bincount(indices, weights=tmp_w.real,
                                  minlength=limit)
            n.imag += np.bincount(indices, weights=tmp_w.imag,
                                  minlength=limit)
        else:
            n += np.bincount(indices, weights=tmp_w,
                             minlength=limit).astype(ntype)
    n.shape = a.shape[:-1] + (bins,)
    return n
data = np.random.randn(4, 5, 6)
out1 = hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5))
out2 = hist_np_laxis(data, bins=200001, range=(- 2.5, 2.5))
print(np.allclose(out1, out2))
True
%timeit hist_np_laxis(data, bins=21, range=(- 2.5, 2.5))
92.1 µs ± 504 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit hist_laxis(data, n_bins=21, range_limits=(- 2.5, 2.5))
55.1 µs ± 3.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Although the first solution is faster in the small example and even the larger example:
data = np.random.randn(400, 500, 6)
%timeit hist_np_laxis(data, bins=21, range=(- 2.5, 2.5))
264 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit hist_laxis(data, n_bins=21, range_limits=(- 2.5, 2.5))
71.6 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
It is not ALWAYS faster:
data = np.random.randn(400, 6, 500)
%timeit hist_np_laxis(data, bins=101, range=(- 2.5, 2.5))
71.5 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit hist_laxis(data, n_bins=101, range_limits=(- 2.5, 2.5))
76.9 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
However, the numpy variation is only faster when the last axis is large.  And its a very slight increase.  In all other cases I tried, the first solution is much faster regardless of bin count and size of the first 2 dimensions.  The only important line ((bins*np.arange(i, i+block_size)[:,None] * keep)[keep].reshape(indices.shape) + indices).reshape(-1) might be more optimizable though I have not found a faster method yet.
This would also imply the sheer number of vectorized operations of O(n) is outdoing the O(n log n) of the sort and repeated incremental searches.
However, realistic use cases will have a last axis with a lot of data and the prior axes with few. So in reality the samples in the first solution are too contrived to fit the desired performance.
Axis addition for histogram is noted as an issue in the numpy repo: https://github.com/numpy/numpy/issues/13166.
An xhistogram library has sought to solve this problem as well: https://xhistogram.readthedocs.io/en/latest/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With