Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching a sequence in a NumPy array

Let's say I have the following array :

 array([2, 0, 0, 1, 0, 1, 0, 0])

How do I get the indices where I have occurrence of sequence of values : [0,0]? So, the expected output for such a case would be : [1,2,6,7].

Edit :

1) Please note that [0,0] is just a sequence. It could be [0,0,0] or [4,6,8,9] or [5,2,0], just anything.

2) If my array were modified to : array([2, 0, 0, 0, 0, 1, 0, 1, 0, 0]), the expected result with the same sequence of [0,0] would be [1,2,3,4,8,9].

I am looking for some NumPy shortcut.

like image 739
sten Avatar asked Apr 09 '16 20:04

sten


2 Answers

Well, this is basically a template-matching problem that comes up in image-processing a lot. Listed in this post are two approaches: Pure NumPy based and OpenCV (cv2) based.

Approach #1: With NumPy, one can create a 2D array of sliding indices across the entire length of the input array. Thus, each row would be a sliding window of elements. Next, match up each row with the input sequence, which will bring in broadcasting for a vectorized solution. We look for all True rows indicating those are the ones that are the perfect matches and as such would be the starting indices of the matches. Finally, using those indices, create a range of indices extending up to the length of the sequence, to give us the desired output. The implementation would be -

def search_sequence_numpy(arr,seq):
    """ Find sequence in an array using NumPy only.

    Parameters
    ----------    
    arr    : input 1D array
    seq    : input 1D array

    Output
    ------    
    Output : 1D Array of indices in the input array that satisfy the 
    matching of input sequence in the input array.
    In case of no match, an empty list is returned.
    """

    # Store sizes of input array and sequence
    Na, Nseq = arr.size, seq.size

    # Range of sequence
    r_seq = np.arange(Nseq)

    # Create a 2D array of sliding indices across the entire length of input array.
    # Match up with the input sequence & get the matching starting indices.
    M = (arr[np.arange(Na-Nseq+1)[:,None] + r_seq] == seq).all(1)

    # Get the range of those indices as final output
    if M.any() >0:
        return np.where(np.convolve(M,np.ones((Nseq),dtype=int))>0)[0]
    else:
        return []         # No match found

Approach #2: With OpenCV (cv2), we have a built-in function for template-matching : cv2.matchTemplate. Using this, we would have the starting matching indices. Rest of the steps would be same as for the previous approach. Here's the implementation with cv2 :

from cv2 import matchTemplate as cv2m

def search_sequence_cv2(arr,seq):
    """ Find sequence in an array using cv2.
    """

    # Run a template match with input sequence as the template across
    # the entire length of the input array and get scores.
    S = cv2m(arr.astype('uint8'),seq.astype('uint8'),cv2.TM_SQDIFF)

    # Now, with floating point array cases, the matching scores might not be 
    # exactly zeros, but would be very small numbers as compared to others.
    # So, for that use a very small to be used to threshold the scorees 
    # against and decide for matches.
    thresh = 1e-5 # Would depend on elements in seq. So, be careful setting this.

    # Find the matching indices
    idx = np.where(S.ravel() < thresh)[0]

    # Get the range of those indices as final output
    if len(idx)>0:
        return np.unique((idx[:,None] + np.arange(seq.size)).ravel())
    else:
        return []         # No match found

Sample run

In [512]: arr = np.array([2, 0, 0, 0, 0, 1, 0, 1, 0, 0])

In [513]: seq = np.array([0,0])

In [514]: search_sequence_numpy(arr,seq)
Out[514]: array([1, 2, 3, 4, 8, 9])

In [515]: search_sequence_cv2(arr,seq)
Out[515]: array([1, 2, 3, 4, 8, 9])

Runtime test

In [477]: arr = np.random.randint(0,9,(100000))
     ...: seq = np.array([3,6,8,4])
     ...: 

In [478]: np.allclose(search_sequence_numpy(arr,seq),search_sequence_cv2(arr,seq))
Out[478]: True

In [479]: %timeit search_sequence_numpy(arr,seq)
100 loops, best of 3: 11.8 ms per loop

In [480]: %timeit search_sequence_cv2(arr,seq)
10 loops, best of 3: 20.6 ms per loop

Seems like the Pure NumPy based one is the safest and fastest!

like image 127
Divakar Avatar answered Nov 07 '22 18:11

Divakar


I find that the most succinct, intuitive and general way to do this is using regular expressions.

import re
import numpy as np

# Set the threshold for string printing to infinite
np.set_printoptions(threshold=np.inf)

# Remove spaces and linebreaks that would come through when printing your vector
yourarray_string = re.sub('\n|\s','',np.array_str( yourarray ))[1:-1]

# The next line is the most important, set the arguments in the braces
# such that the first argument is the shortest sequence you want
# and the second argument is the longest (using empty as infinite length)

r = re.compile(r"[0]{1,}") 
zero_starts = [m.start() for m in r.finditer( yourarray_string )]
zero_ends = [m.end() for m in r.finditer( yourarray_string )]
like image 1
Jozsef Meszaros Avatar answered Nov 07 '22 19:11

Jozsef Meszaros