Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up the performance of array masking from the results of numpy.searchsorted in python?

I want to generate a mask from the results of numpy.searchsorted():

import numpy as np

# generate test examples
x = np.random.rand(1000000)
y = np.random.rand(200)

# sort x
idx = np.argsort(x)
sorted_x = np.take_along_axis(x, idx, axis=-1)

# searchsort y in x
pt = np.searchsorted(sorted_x, y)

pt is an array. Then I want to create a boolean mask of size (200, 1000000) with True values when its indices are idx[0:pt[i]], and I come up with a for-loop like this:

mask = np.zeros((200, 1000000), dtype='bool')
for i in range(200):
     mask[i, idx[0:pt[i]]] = True

Anyone has an idea to speed up the for-loop?

like image 634
f. c. Avatar asked Oct 23 '20 09:10

f. c.


People also ask

How can I speed up my NumPy operation?

By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime.

What makes a NumPy array faster than other Python objects?

Because the Numpy array is densely packed in memory due to its homogeneous type, it also frees the memory faster. So overall a task executed in Numpy is around 5 to 100 times faster than the standard python list, which is a significant leap in terms of speed.

What is Searchsorted in NumPy?

searchsorted() in Python. numpy. searchsorted() function is used to find the indices into a sorted array arr such that, if elements are inserted before the indices, the order of arr would be still preserved. Here, binary search is used to find the required insertion indices.

Why NumPy array operations are faster?

NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.

How to speed up NumPy array processing in Cython?

By explicitly declaring the "ndarray" data type, your array processing can be 1250x faster. This tutorial will show you how to speed up the processing of NumPy arrays using Cython. By explicitly specifying the data types of variables in Python, Cython can give drastic speed increases at runtime. The sections covered in this tutorial are as follows:

How does NumPy work in Cython?

Previously we saw that Cython code runs very quickly after explicitly defining C types for the variables used. This is also the case for the NumPy array. If we leave the NumPy array in its current form, Cython works exactly as regular Python does by creating an object for each number in the array.

Is NumPy slower than Python?

numpy will be slower than regular python if you iterate over arrays like you do here. Instead numpy is meant to be used with vectorized operations. and that takes 0.75sec on my system for a billion elements compared to 115sec when looping over the array: a 155X improvement by changing 2 lines and without Cython.

How to increase the length of an array in NumPy?

The maxval variable is set equal to the length of the NumPy array. We can start by creating an array of length 10,000 and increase this number later to compare how Cython improves compared to Python. After creating a variable of type numpy.ndarray and defining its length, next is to create the array using the numpy.arange () function.


3 Answers

Approach #1

Going by the new-found information picked up off OP's comments that states only y is changing in real-time, we can pre-process lots of stuffs around x and hence do much better. We will create a hashing array that will store stepped masks. For the part that involves y, we will simply index into the hashing array with the indices obtained off searchsorted which will approximate the final mask array. A final step of assigning the remaining bools could be offloaded to numba given its ragged nature. This should also be beneficial if we decide to scale up the lengths of y.

Let's look at the implementation.

Pre-processing with x :

sidx = x.argsort()
ssidx = x.argsort().argsort()

# Choose a scale factor. 
# 1. A small one would store more mapping info, hence faster but occupy more mem
# 2. A big one would store less mapping info, hence slower, but memory efficient.
scale_factor = 100
mapar = np.arange(0,len(x),scale_factor)[:,None] > ssidx

Remaining steps with y :

import numba as nb

@nb.njit(parallel=True,fastmath=True)
def array_masking3(out, starts, idx, sidx):
    N = len(out)
    for i in nb.prange(N):
        for j in nb.prange(starts[i], idx[i]):
            out[i,sidx[j]] = True
    return out

idx = np.searchsorted(x,y,sorter=sidx)
s0 = idx//scale_factor
starts = s0*scale_factor
out = mapar[s0]
out = array_masking3(out, starts, idx, sidx)

Benchmarking

In [2]: x = np.random.rand(1000000)
   ...: y = np.random.rand(200)

In [3]: ## Pre-processing step with "x"
   ...: sidx = x.argsort()
   ...: ssidx = x.argsort().argsort()
   ...: scale_factor = 100
   ...: mapar = np.arange(0,len(x),scale_factor)[:,None] > ssidx

In [4]: %%timeit
   ...: idx = np.searchsorted(x,y,sorter=sidx)
   ...: s0 = idx//scale_factor
   ...: starts = s0*scale_factor
   ...: out = mapar[s0]
   ...: out = array_masking3(out, starts, idx, sidx)
41 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

 # A 1/10th smaller hashing array has similar timings
In [7]: scale_factor = 1000
   ...: mapar = np.arange(0,len(x),scale_factor)[:,None] > ssidx

In [8]: %%timeit
   ...: idx = np.searchsorted(x,y,sorter=sidx)
   ...: s0 = idx//scale_factor
   ...: starts = s0*scale_factor
   ...: out = mapar[s0]
   ...: out = array_masking3(out, starts, idx, sidx)
40.6 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# @silgon's soln    
In [5]: %timeit x[np.newaxis,:] < y[:,np.newaxis]
138 ms ± 896 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Approach #2

This borrowed a good part from OP's solution.

import numba as nb

@nb.njit(parallel=True)
def array_masking2(mask1D, mask_out, idx, pt):
    n = len(idx)
    for j in nb.prange(len(pt)):
        if mask1D[j]:
            for i in nb.prange(pt[j],n):
                mask_out[j, idx[i]] = False
        else:
            for i in nb.prange(pt[j]):
                mask_out[j, idx[i]] = True
    return mask_out

def app2(idx, pt):
    m,n = len(pt), len(idx)      
    mask1 = pt>len(x)//2
    mask2 = np.broadcast_to(mask1[:,None], (m,n)).copy()
    return array_masking2(mask1, mask2, idx, pt)

So, the idea is once, we have larger than half of indices to be set True, we switch over to set False instead after pre-assigning those rows as all True. This results in lesser memory accesses and hence some noticeable performance boost.

Benchmarking

OP's solution :

@nb.njit(parallel=True,fastmath=True)
def array_masking(mask, idx, pt):
    for j in nb.prange(pt.shape[0]):
        for i in nb.prange(pt[j]):
            mask[j, idx[i]] = True
    return mask

def app1(idx, pt):
    m,n = len(pt), len(idx)      
    mask = np.zeros((m, n), dtype='bool')
    return array_masking(mask, idx, pt)

Timings -

In [5]: np.random.seed(0)
   ...: x = np.random.rand(1000000)
   ...: y = np.random.rand(200)

In [6]: %timeit app1(idx, pt)
264 ms ± 8.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit app2(idx, pt)
165 ms ± 3.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
like image 150
Divakar Avatar answered Oct 18 '22 00:10

Divakar


This is an alternate answer but not sure that it's exactly what you need.

x = np.random.rand(1000000)
y = np.random.rand(200)
mask = x[np.newaxis,:] < y[:,np.newaxis]

Note: I mentioned that maybe it's not what you need because you specify the need of numpy.searchsorted() and here I don't use it, however I get the same result. It might also be useful to somebody else in some future if it doesn't completely fits your needs ;).

Timings (@DanielF edit)

Setup:

import numpy as np

# generate test examples
x = np.random.rand(1000000)
y = np.random.rand(200)

# sort x
idx = np.argsort(x)
sorted_x = np.take_along_axis(x, idx, axis=-1)

Running:

%%timeit   #  silgon
mask = x[np.newaxis,:] < y[:,np.newaxis]

166 ms ± 3.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


%%timeit    # Divakar
pt = np.searchsorted(sorted_x, y)
mask = app2(idx, pt)

316 ms ± 29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


%%timeit   #  f.c.
pt = np.searchsorted(sorted_x, y)
mask = np.zeros((200, 1000000), dtype='bool')
for i in range(200):
     mask[i, idx[0:pt[i]]] = True
     
466 ms ± 13.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 45
silgon Avatar answered Oct 17 '22 23:10

silgon


I have implemented a numba version of the for-loop, and it is slightly faster than the python version. Here is the code:

import numba as nb
@nb.njit(parallel=True,fastmath=True)
def array_masking(mask, idx, pt):
    for j in nb.prange(pt.shape[0]):
        for i in nb.prange(pt[j]):
            mask[j, idx[i]] = True

I am looking for further speedup. Any ideas?

like image 1
f. c. Avatar answered Oct 17 '22 23:10

f. c.