Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently return the index of the first value satisfying condition in array

I need to find the index of the first value in a 1d NumPy array, or Pandas numeric series, satisfying a condition. The array is large and the index may be near the start or end of the array, or the condition may not be met at all. I can't tell in advance which is more likely. If the condition is not met, the return value should be -1. I've considered a few approaches.

Attempt 1

# func(arr) returns a Boolean array
idx = next(iter(np.where(func(arr))[0]), -1)

But this is often too slow as func(arr) applies a vectorised function on the entire array rather than stopping when the condition is met. Specifically, it is expensive when the condition is met near the start of the array.

Attempt 2

np.argmax is marginally faster, but fails to identify when a condition is never met:

np.random.seed(0)
arr = np.random.rand(10**7)

assert next(iter(np.where(arr > 0.999999)[0]), -1) == np.argmax(arr > 0.999999)

%timeit next(iter(np.where(arr > 0.999999)[0]), -1)  # 21.2 ms
%timeit np.argmax(arr > 0.999999)                    # 17.7 ms

np.argmax(arr > 1.0) returns 0, i.e. an instance when the condition is not satisfied.

Attempt 3

# func(arr) returns a Boolean scalar
idx = next((idx for idx, val in enumerate(arr) if func(arr)), -1)

But this is too slow when the condition is met near the end of the array. Presumably this is because the generator expression has an expensive overhead from a large number of __next__ calls.

Is this always a compromise or is there a way, for generic func, to extract the first index efficiently?

Benchmarking

For benchmarking, assume func finds the index when a value is greater than a given constant:

# Python 3.6.5, NumPy 1.14.3, Numba 0.38.0
import numpy as np

np.random.seed(0)
arr = np.random.rand(10**7)
m = 0.9
n = 0.999999

# Start of array benchmark
%timeit next(iter(np.where(arr > m)[0]), -1)                       # 43.5 ms
%timeit next((idx for idx, val in enumerate(arr) if val > m), -1)  # 2.5 µs

# End of array benchmark
%timeit next(iter(np.where(arr > n)[0]), -1)                       # 21.4 ms
%timeit next((idx for idx, val in enumerate(arr) if val > n), -1)  # 39.2 ms
like image 629
jpp Avatar asked Oct 27 '18 10:10

jpp


1 Answers

numba

With numba it's possible to optimise both scenarios. Syntactically, you need only construct a function with a simple for loop:

from numba import njit

@njit
def get_first_index_nb(A, k):
    for i in range(len(A)):
        if A[i] > k:
            return i
    return -1

idx = get_first_index_nb(A, 0.9)

Numba improves performance by JIT ("Just In Time") compiling code and leveraging CPU-level optimisations. A regular for loop without the @njit decorator would typically be slower than the methods you've already tried for the case where the condition is met late.

For a Pandas numeric series df['data'], you can simply feed the NumPy representation to the JIT-compiled function:

idx = get_first_index_nb(df['data'].values, 0.9)

Generalisation

Since numba permits functions as arguments, and assuming the passed the function can also be JIT-compiled, you can arrive at a method to calculate the nth index where a condition is met for an arbitrary func.

@njit
def get_nth_index_count(A, func, count):
    c = 0
    for i in range(len(A)):
        if func(A[i]):
            c += 1
            if c == count:
                return i
    return -1

@njit
def func(val):
    return val > 0.9

# get index of 3rd value where func evaluates to True
idx = get_nth_index_count(arr, func, 3)

For the 3rd last value, you can feed the reverse, arr[::-1], and negate the result from len(arr) - 1, the - 1 necessary to account for 0-indexing.

Performance benchmarking

# Python 3.6.5, NumPy 1.14.3, Numba 0.38.0

np.random.seed(0)
arr = np.random.rand(10**7)
m = 0.9
n = 0.999999

@njit
def get_first_index_nb(A, k):
    for i in range(len(A)):
        if A[i] > k:
            return i
    return -1

def get_first_index_np(A, k):
    for i in range(len(A)):
        if A[i] > k:
            return i
    return -1

%timeit get_first_index_nb(arr, m)                                 # 375 ns
%timeit get_first_index_np(arr, m)                                 # 2.71 µs
%timeit next(iter(np.where(arr > m)[0]), -1)                       # 43.5 ms
%timeit next((idx for idx, val in enumerate(arr) if val > m), -1)  # 2.5 µs

%timeit get_first_index_nb(arr, n)                                 # 204 µs
%timeit get_first_index_np(arr, n)                                 # 44.8 ms
%timeit next(iter(np.where(arr > n)[0]), -1)                       # 21.4 ms
%timeit next((idx for idx, val in enumerate(arr) if val > n), -1)  # 39.2 ms
like image 52
jpp Avatar answered Oct 18 '22 22:10

jpp