Fastest way to get the minimum value of data array in another paired bin array

numpy method

import random
import numpy as np

# create example data
out_size = int(10)
bins = np.arange(3, out_size-3)
idxs = np.arange(0, out_size)
#random.shuffle(idxs)

# set duplicated slice manually for test
idxs[4] = idxs[3]
idxs[6] = idxs[7]

weights = idxs

# get which bin idxs belong to
slices = np.digitize(idxs, bins)

# get index and weights in bins
valid = (bins.max() >= idxs) & (idxs >= bins.min())
valid_slices = slices[valid]
valid_weights = weights[valid]

# sort slice and weights
sort_index = valid_slices.argsort()
valid_slices_sort = valid_slices[sort_index]
valid_weights_sort = valid_weights[sort_index]

# get index of each first unque slices
unique_slices, unique_index = np.unique(valid_slices_sort, return_index=True)
# calculate the minimum
res_sub = np.minimum.reduceat(valid_weights_sort, unique_index)

# save results
res = np.full((out_size), np.nan)
res[unique_slices-1] = res_sub

print(res)

Results:

array([ 3., nan,  5., nan, nan, nan, nan, nan, nan, nan])

If I increase the out_size to 1e7 and shuffle the data, the speed (from np.digitize to the end) is slow:

13.5 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And, here's the consumed time of each part:

np.digitize: 10.8 s ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
valid: 171 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
argsort and slice: 2.02 s ± 33.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
unique: 9.9 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.minimum.reduceat: 5.11 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

np.digitize() costs the most of time: 10.8 s. And, the next is argsort: 2.02 s.

I also check the consumed time of calculating mean using np.histogram:

counts, _ = np.histogram(idxs, bins=out_size, range=(0, out_size))
sums, _ = np.histogram(idxs, bins=out_size, range=(0, out_size), weights = weights, density=False)
mean = sums / np.where(counts == 0, np.nan, counts)

33.2 s ± 3.47 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is similar to my method of calculating minimum.

scipy method

from scipy.stats import binned_statistic
statistics, _, _ = binned_statistic(idxs, weights, statistic='min', bins=bins)

print(statistics)

The result is a little different, but the speed is much slower (x6) for the longer(1e7) shuffled data:

array([ 3., nan,  5.])

1min 20s ± 6.93 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Summary

I want to figure out a quicker method. If the method is also suitable for dask, that would be excellent!

User Case

Here's how my real data (1D) look like:

763

asked Jun 08 '21 06:06

zxdawn

2 Answers

SultanOrazbayev showed a quick approach; I'll add a cool one.

mask = bins[:, None] == idxs[None, :]
result = np.nanmin(np.where(mask, weights, np.nan), axis=-1)
# Note: may produce (expected) runtime warning if bin has no values

of course, you can also do np.nanmax, np.nanmean, etc.

The above assumes that your bins are indeed single values. If they are ranges, you need slightly little more work to construct the mask

lower_mask = idxs[None, :] >= bins[:, None]
upper_mask = np.empty_like(lower_mask)
upper_mask[:-1, ...] = idxs[None, :] < bins[1:, None]
upper_mask[-1, ...] = False

mask = lower_mask & upper_mask

at which point you can use np.nanmin like above.

Ofc np.where and the broadcast to create a mask will create new arrays of shape (len(bins), len(idxs)) with their respective datatype. If this is of no concern to you, then the above is probably good enough.

If it is a problem (because you are pressed for RAM), then my first suggestion is to buy more RAM. If - for some stupid reason (say, money) - that doesn't work, you can avoid the copy of weights by using a masked array over a manually re-strided view into weights

import numpy.ma as ma

mask = ...

restrided_weights = np.lib.stride_tricks.as_strided(weights, shape=(bins.size, idxs.size), strides=(0, idxs.strides[0]))
masked = ma.masked_array(restrided_weights, mask=~mask, fill_value=np.nan, dtype=np.float64)
result = masked.min(axis=-1).filled(np.nan)

this avoids both, a copy of weights and the above-mentioned runtime warning.

If you don't even have enough memory to construct mask, then you can try processing the data in chunks.

Last I checked, Dask used to have funny behavior when fed with manually strided arrays. There was some work on this though, so you may want to double-check if that is resolved, in which case you can happily parallelize the above.

Update based on your further comments to this answer and the other:

You can do this computation in chunks to avoid memory issues due to your large number of bins (1e4 in magnitude). Putting the concrete numbers into a full example and adding a progress bar indicates <40 seconds runtime on a single core.

import numpy.ma as ma
from tqdm import trange
import numpy as np
import random

# create example data
out_size = int(1.5e5)
#bins = np.arange(3, out_size-3)
bins = np.arange(3, int(3.8e4-3), dtype=np.int64)
idxs = np.arange(0, out_size)
random.shuffle(idxs)

# set duplicated slice manually for test
idxs[4] = idxs[3]
idxs[6] = idxs[7]

weights = idxs

chunk_size = 100

# preallocate buffers to avoid array creation in main loop
extended_bins = np.empty(len(bins) + 1, dtype=bins.dtype)
extended_bins[:-1] = bins
extended_bins[-1] = np.iinfo(bins.dtype).max # last bin goes to infinity
mask_buffer = np.empty((chunk_size, len(idxs)), dtype=bool)


result = np.empty_like(bins, dtype=np.float64)

for low in trange(0, len(bins), chunk_size):
    high = min(low + chunk_size, len(bins))
    chunk_size = high - low
    mask_buffer[:chunk_size, ...] = ~((bins[low:high, None] <= idxs[None, :]) & (extended_bins[low+1:high+1, None] > idxs[None, :]))
    mask = mask_buffer[:chunk_size, ...]
    restrided_weights = np.lib.stride_tricks.as_strided(weights, shape=mask.shape, strides=(0, idxs.strides[0]))
    masked = ma.masked_array(restrided_weights, mask=mask, fill_value=np.nan, dtype=np.float64)
    result[low:high] = masked.min(axis=-1).filled(np.nan)

Bonus: For min and max only there is a cool trick that you can use: sort idxs and weights based on weights (ascending for min, descending for max). This way, you can immediately look up the min/max value and can avoid the masked array and the custom strides altogether. This relies on some not so well documented behavior of np.argmax, which takes a fast-pass for boolean arrays and doesn't search the full array.

It only works for these two cases, and you'd have to fall back to the above for more sophisticated things (mean), but for those two it shaves off another ~70% and a run on a single core clocks in at <12 seconds.

# fast min/max
from tqdm import trange
import numpy as np

# create example data
out_size = int(1.5e5)
#bins = np.arange(3, out_size-3)
bins = np.arange(3, int(3.8e4-3), dtype=np.int64)
idxs = np.arange(0, out_size)
random.shuffle(idxs)

# set duplicated slice manually for test
idxs[4] = idxs[3]
idxs[6] = idxs[7]

weights = idxs


order = np.argsort(weights)
weights_sorted = np.empty((len(weights) + 1), dtype=np.float64)
weights_sorted[:-1] = weights[order]
weights_sorted[-1] = np.nan

idxs_sorted = idxs[order]

extended_bins = np.empty(len(bins) + 1, dtype=bins.dtype)
extended_bins[:-1] = bins
extended_bins[-1] = np.iinfo(bins.dtype).max # last bin goes to infinity

# preallocate buffers to avoid array creation in main loop
chunk_size = 1000
mask_buffer = np.empty((chunk_size, len(idxs) + 1), dtype=bool)
mask_buffer[:, -1] = True
result = np.empty_like(bins, dtype=np.float64)

for low in trange(0, len(bins), chunk_size):
    high = min(low + chunk_size, len(bins))
    chunk_size = high - low
    mask_buffer[:chunk_size, :-1] = (bins[low:high, None] <= idxs_sorted[None, :]) & (extended_bins[low+1:high+1, None] > idxs_sorted[None, :])
    mask = mask_buffer[:chunk_size, ...]
    weight_idx = np.argmax(mask, axis=-1)

    result[low:high] = weights_sorted[weight_idx]

164

answered Dec 26 '22 17:12

FirefoxMetzger

A quick approach to achieve this is with dask.dataframe and pd.cut, I first show the pandas version:

import numpy as np
from scipy.stats import binned_statistic as bs
import pandas as pd

nrows=10**7

df = pd.DataFrame(np.random.rand(nrows, 2), columns=['x', 'val'])

bins = np.linspace(df['x'].min(), df['x'].max(), 10)

df['binned_x'] = pd.cut(df['x'], bins=bins, right=False)

result_pandas = df.groupby('binned_x')['val'].min().values
result_scipy = bs(df['x'], df['val'], 'min', bins=bins)[0]

print(np.isclose(result_pandas, result_scipy))
# [ True  True  True  True  True  True  True  True  True]

Now to go from pandas to dask, you will need to make sure that bins are consistent across partitions, so take a look here. Once every partition is binned consistently, you want to apply the desired operation (min/max/sum/count):

import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=10)

def f(df, bins):
    df = df.copy()
    df['binned_x'] = pd.cut(df['x'], bins=bins, right=False)
    result = df.groupby('binned_x', as_index=False)['val'].min()
    return result

result_dask = ddf.map_partitions(f, bins).groupby('binned_x')['val'].min().compute()

print(np.isclose(result_pandas, result_dask))
# [ True  True  True  True  True  True  True  True  True]

On my laptop, the first code take about 7 3 seconds, the second code is about 10 times faster (forgot that I am double-counting both pandas and scipy performing the same operation). There is scope for playing around with partitioning, but that's context-dependent, so something you can try optimizing on your data/hardware.

Update: note that this approach will work for min/max, but for mean you will want to calculate sum and count and then divide them. There is probably a good way of keeping track of weights in doing this calculation in one go, but it might not be worth the added code complexity.

answered Dec 26 '22 15:12

SultanOrazbayev

Related questions
                            
                                Python - Find current objects in memory
                            
                                Selenium gives "Timed out receiving message from renderer" for all websites after some execution time
                            
                                How to type the __new__ method in a Python metaclass so that mypy is happy
                            
                                How to transpile python Compare ast nodes to c?
                            
                                /usr/local/bin/pip: bad interpreter: /usr/local/opt/python/bin/python3.7
                            
                                How to create any AWS Lambda Python Layer? (Usage example with XGBoost)
                            
                                Stay SOLID and DRY with coroutines and functions as methods in python
                            
                                Plotly: How to set choropleth map color for a discrete categorical variable?
                            
                                Does it make a difference if you iterate over a list or a tuple in Python?
                            
                                lint usages of functions with @deprecated decorator
                            
                                Top K Frequent Words using heaps in Python [duplicate]
                            
                                How to do multiple queries?
                            
                                await vs asyncio.run() in Python
                            
                                remove duplicate value from list of tuples based on values from another list
                            
                                Tortoise ORM for Python no returns relations of entities (Pyndantic, FastAPI)
                            
                                Can't install parquet via pip nor conda on macOS "Big Sur"
                            
                                Tunnel not found error Pyngrok streamlit collab
                            
                                Creating a pandas pivot table to count number of times items appear in a list together
                            
                                fastapi dependency vs middleware
                            
                                Partial of a class coroutine isn't a coroutine. Why?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest way to get the minimum value of data array in another paired bin array

Tags:

python

pandas

numpy

scipy

dask