Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to get the minimum value of data array in another paired bin array

I have three 1D arrays:

  • idxs: the index data
  • weights: the weight of each index in idxs
  • bins: the bin which is used to calculate minimum weight in it.

Here's my current method of using the idxs to check the data called weights in which bin, and then calculate the min/max of binned weights:

illustration

  1. Get slices that shows which bin each idxs element belongs to.
  2. Sort slices and weights at the same time.
  3. Calculate the minimum of weights in each bin (slice).

numpy method

import random
import numpy as np

# create example data
out_size = int(10)
bins = np.arange(3, out_size-3)
idxs = np.arange(0, out_size)
#random.shuffle(idxs)

# set duplicated slice manually for test
idxs[4] = idxs[3]
idxs[6] = idxs[7]

weights = idxs

# get which bin idxs belong to
slices = np.digitize(idxs, bins)

# get index and weights in bins
valid = (bins.max() >= idxs) & (idxs >= bins.min())
valid_slices = slices[valid]
valid_weights = weights[valid]

# sort slice and weights
sort_index = valid_slices.argsort()
valid_slices_sort = valid_slices[sort_index]
valid_weights_sort = valid_weights[sort_index]

# get index of each first unque slices
unique_slices, unique_index = np.unique(valid_slices_sort, return_index=True)
# calculate the minimum
res_sub = np.minimum.reduceat(valid_weights_sort, unique_index)

# save results
res = np.full((out_size), np.nan)
res[unique_slices-1] = res_sub

print(res)

Results:

array([ 3., nan,  5., nan, nan, nan, nan, nan, nan, nan])

If I increase the out_size to 1e7 and shuffle the data, the speed (from np.digitize to the end) is slow:

13.5 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And, here's the consumed time of each part:

np.digitize: 10.8 s ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
valid: 171 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
argsort and slice: 2.02 s ± 33.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
unique: 9.9 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.minimum.reduceat: 5.11 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

np.digitize() costs the most of time: 10.8 s. And, the next is argsort: 2.02 s.

I also check the consumed time of calculating mean using np.histogram:

counts, _ = np.histogram(idxs, bins=out_size, range=(0, out_size))
sums, _ = np.histogram(idxs, bins=out_size, range=(0, out_size), weights = weights, density=False)
mean = sums / np.where(counts == 0, np.nan, counts)

33.2 s ± 3.47 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is similar to my method of calculating minimum.

scipy method

from scipy.stats import binned_statistic
statistics, _, _ = binned_statistic(idxs, weights, statistic='min', bins=bins)

print(statistics)

The result is a little different, but the speed is much slower (x6) for the longer(1e7) shuffled data:

array([ 3., nan,  5.])

1min 20s ± 6.93 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Summary

I want to figure out a quicker method. If the method is also suitable for dask, that would be excellent!

User Case

Here's how my real data (1D) look like: real data

like image 763
zxdawn Avatar asked Jun 08 '21 06:06

zxdawn


People also ask

How do you find the minimum value of an array?

To find the minimum value present in a given array, we can use the Math. min() function in JavaScript. This function returns the minimum value present in a given array.

How do you find the maximum and minimum of an array?

The function getresult( int arr[],int n) is to find the maximum and minimum element present in the array in minimum no. of comparisons. If there is only one element then we will initialize the variables max and min with arr[0] . For more than one element, we will initialize max with arr[1] and min with arr[0].

How do you find the max and min of an array in Python?

In python is very easy to find out maximum, minimum element and their position also. Python provides different inbuilt function. min() is used for find out minimum value in an array, max() is used for find out maximum value in an array. index() is used for finding the index of the element.

How do you find the minimum and maximum of an array?

We can use min_element () and max_element () to find minimum and maximum of array. // in an array. # in an array. // (or maximum) element in an array. // element in an array. echo "Minimum element of array: " . echo "Maximum element of array: " . // in an array.

How to minimize the required value of an integer array?

Approach: In order to minimize the required value, the values of a, b and c have to be minimized. So, traverse the array and find the minimum values of a, b, and c associated with these characters in the integer array and finally return min (a + b, c). Below is the implementation of the above approach:

How to find the smallest or largest value from an array?

S ometimes, given an array of numbers in JavaScript, the smallest or largest value needs to be identified — and quickly! There are several built-in ways to find a minimum or maximum value from an array in JavaScript, including using the Math functions with the spread operator ( …) and sorting the array numerically with .sort ().

How to get the minimum value of an array in NumPy?

You can see that the min value in the above array is 1. Pass the array as an argument to the Numpy amin () function to get its minimum value. We get the minimum value in the array as 1 which is the correct answer. You can also use the Numpy min () function (which is an alias for the Numpy amin () function) to get the minimum value of a Numpy array.


2 Answers

SultanOrazbayev showed a quick approach; I'll add a cool one.

mask = bins[:, None] == idxs[None, :]
result = np.nanmin(np.where(mask, weights, np.nan), axis=-1)
# Note: may produce (expected) runtime warning if bin has no values

of course, you can also do np.nanmax, np.nanmean, etc.

The above assumes that your bins are indeed single values. If they are ranges, you need slightly little more work to construct the mask

lower_mask = idxs[None, :] >= bins[:, None]
upper_mask = np.empty_like(lower_mask)
upper_mask[:-1, ...] = idxs[None, :] < bins[1:, None]
upper_mask[-1, ...] = False

mask = lower_mask & upper_mask

at which point you can use np.nanmin like above.


Ofc np.where and the broadcast to create a mask will create new arrays of shape (len(bins), len(idxs)) with their respective datatype. If this is of no concern to you, then the above is probably good enough.

If it is a problem (because you are pressed for RAM), then my first suggestion is to buy more RAM. If - for some stupid reason (say, money) - that doesn't work, you can avoid the copy of weights by using a masked array over a manually re-strided view into weights

import numpy.ma as ma

mask = ...

restrided_weights = np.lib.stride_tricks.as_strided(weights, shape=(bins.size, idxs.size), strides=(0, idxs.strides[0]))
masked = ma.masked_array(restrided_weights, mask=~mask, fill_value=np.nan, dtype=np.float64)
result = masked.min(axis=-1).filled(np.nan)

this avoids both, a copy of weights and the above-mentioned runtime warning.

If you don't even have enough memory to construct mask, then you can try processing the data in chunks.

Last I checked, Dask used to have funny behavior when fed with manually strided arrays. There was some work on this though, so you may want to double-check if that is resolved, in which case you can happily parallelize the above.


Update based on your further comments to this answer and the other:

You can do this computation in chunks to avoid memory issues due to your large number of bins (1e4 in magnitude). Putting the concrete numbers into a full example and adding a progress bar indicates <40 seconds runtime on a single core.

import numpy.ma as ma
from tqdm import trange
import numpy as np
import random

# create example data
out_size = int(1.5e5)
#bins = np.arange(3, out_size-3)
bins = np.arange(3, int(3.8e4-3), dtype=np.int64)
idxs = np.arange(0, out_size)
random.shuffle(idxs)

# set duplicated slice manually for test
idxs[4] = idxs[3]
idxs[6] = idxs[7]

weights = idxs

chunk_size = 100

# preallocate buffers to avoid array creation in main loop
extended_bins = np.empty(len(bins) + 1, dtype=bins.dtype)
extended_bins[:-1] = bins
extended_bins[-1] = np.iinfo(bins.dtype).max # last bin goes to infinity
mask_buffer = np.empty((chunk_size, len(idxs)), dtype=bool)


result = np.empty_like(bins, dtype=np.float64)

for low in trange(0, len(bins), chunk_size):
    high = min(low + chunk_size, len(bins))
    chunk_size = high - low
    mask_buffer[:chunk_size, ...] = ~((bins[low:high, None] <= idxs[None, :]) & (extended_bins[low+1:high+1, None] > idxs[None, :]))
    mask = mask_buffer[:chunk_size, ...]
    restrided_weights = np.lib.stride_tricks.as_strided(weights, shape=mask.shape, strides=(0, idxs.strides[0]))
    masked = ma.masked_array(restrided_weights, mask=mask, fill_value=np.nan, dtype=np.float64)
    result[low:high] = masked.min(axis=-1).filled(np.nan)

Bonus: For min and max only there is a cool trick that you can use: sort idxs and weights based on weights (ascending for min, descending for max). This way, you can immediately look up the min/max value and can avoid the masked array and the custom strides altogether. This relies on some not so well documented behavior of np.argmax, which takes a fast-pass for boolean arrays and doesn't search the full array.

It only works for these two cases, and you'd have to fall back to the above for more sophisticated things (mean), but for those two it shaves off another ~70% and a run on a single core clocks in at <12 seconds.

# fast min/max
from tqdm import trange
import numpy as np

# create example data
out_size = int(1.5e5)
#bins = np.arange(3, out_size-3)
bins = np.arange(3, int(3.8e4-3), dtype=np.int64)
idxs = np.arange(0, out_size)
random.shuffle(idxs)

# set duplicated slice manually for test
idxs[4] = idxs[3]
idxs[6] = idxs[7]

weights = idxs


order = np.argsort(weights)
weights_sorted = np.empty((len(weights) + 1), dtype=np.float64)
weights_sorted[:-1] = weights[order]
weights_sorted[-1] = np.nan

idxs_sorted = idxs[order]

extended_bins = np.empty(len(bins) + 1, dtype=bins.dtype)
extended_bins[:-1] = bins
extended_bins[-1] = np.iinfo(bins.dtype).max # last bin goes to infinity

# preallocate buffers to avoid array creation in main loop
chunk_size = 1000
mask_buffer = np.empty((chunk_size, len(idxs) + 1), dtype=bool)
mask_buffer[:, -1] = True
result = np.empty_like(bins, dtype=np.float64)

for low in trange(0, len(bins), chunk_size):
    high = min(low + chunk_size, len(bins))
    chunk_size = high - low
    mask_buffer[:chunk_size, :-1] = (bins[low:high, None] <= idxs_sorted[None, :]) & (extended_bins[low+1:high+1, None] > idxs_sorted[None, :])
    mask = mask_buffer[:chunk_size, ...]
    weight_idx = np.argmax(mask, axis=-1)

    result[low:high] = weights_sorted[weight_idx]
like image 164
FirefoxMetzger Avatar answered Dec 26 '22 17:12

FirefoxMetzger


A quick approach to achieve this is with dask.dataframe and pd.cut, I first show the pandas version:

import numpy as np
from scipy.stats import binned_statistic as bs
import pandas as pd

nrows=10**7

df = pd.DataFrame(np.random.rand(nrows, 2), columns=['x', 'val'])

bins = np.linspace(df['x'].min(), df['x'].max(), 10)

df['binned_x'] = pd.cut(df['x'], bins=bins, right=False)

result_pandas = df.groupby('binned_x')['val'].min().values
result_scipy = bs(df['x'], df['val'], 'min', bins=bins)[0]

print(np.isclose(result_pandas, result_scipy))
# [ True  True  True  True  True  True  True  True  True]

Now to go from pandas to dask, you will need to make sure that bins are consistent across partitions, so take a look here. Once every partition is binned consistently, you want to apply the desired operation (min/max/sum/count):

import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=10)

def f(df, bins):
    df = df.copy()
    df['binned_x'] = pd.cut(df['x'], bins=bins, right=False)
    result = df.groupby('binned_x', as_index=False)['val'].min()
    return result

result_dask = ddf.map_partitions(f, bins).groupby('binned_x')['val'].min().compute()

print(np.isclose(result_pandas, result_dask))
# [ True  True  True  True  True  True  True  True  True]

On my laptop, the first code take about 7 3 seconds, the second code is about 10 times faster (forgot that I am double-counting both pandas and scipy performing the same operation). There is scope for playing around with partitioning, but that's context-dependent, so something you can try optimizing on your data/hardware.

Update: note that this approach will work for min/max, but for mean you will want to calculate sum and count and then divide them. There is probably a good way of keeping track of weights in doing this calculation in one go, but it might not be worth the added code complexity.

like image 44
SultanOrazbayev Avatar answered Dec 26 '22 15:12

SultanOrazbayev