Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python/NumPy first occurrence of subarray

In Python or NumPy, what is the best way to find out the first occurrence of a subarray?

For example, I have

a = [1, 2, 3, 4, 5, 6]
b = [2, 3, 4]

What is the fastest way (run-time-wise) to find out where b occurs in a? I understand for strings this is extremely easy, but what about for a list or numpy ndarray?

Thanks a lot!

[EDITED] I prefer the numpy solution, since from my experience numpy vectorization is much faster than Python list comprehension. Meanwhile, the big array is huge, so I don't want to convert it into a string; that will be (too) long.

like image 624
CuriousMind Avatar asked Aug 17 '11 22:08

CuriousMind


2 Answers

The following code should work:

[x for x in xrange(len(a)) if a[x:x+len(b)] == b]

Returns the index at which the pattern starts.

like image 99
danem Avatar answered Oct 13 '22 10:10

danem


I'm assuming you're looking for a numpy-specific solution, rather than a simple list comprehension or for loop. One straightforward approach is to use the rolling window technique to search for windows of the appropriate size.

This approach is simple, works correctly, and is much faster than any pure Python solution. It should be sufficient for many use cases. However, it is not the most efficient approach possible, for a number of reasons. For an approach that is more complicated, but asymptotically optimal in the expected case, see the numba-based rolling hash implementation in norok2's answer.

Here's the rolling_window function:

>>> def rolling_window(a, size):
...     shape = a.shape[:-1] + (a.shape[-1] - size + 1, size)
...     strides = a.strides + (a. strides[-1],)
...     return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
... 

Then you could do something like

>>> a = numpy.arange(10)
>>> numpy.random.shuffle(a)
>>> a
array([7, 3, 6, 8, 4, 0, 9, 2, 1, 5])
>>> rolling_window(a, 3) == [8, 4, 0]
array([[False, False, False],
       [False, False, False],
       [False, False, False],
       [ True,  True,  True],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False]], dtype=bool)

To make this really useful, you'd have to reduce it along axis 1 using all:

>>> numpy.all(rolling_window(a, 3) == [8, 4, 0], axis=1)
array([False, False, False,  True, False, False, False, False], dtype=bool)

Then you could use that however you'd use a boolean array. A simple way to get the index out:

>>> bool_indices = numpy.all(rolling_window(a, 3) == [8, 4, 0], axis=1)
>>> numpy.mgrid[0:len(bool_indices)][bool_indices]
array([3])

For lists you could adapt one of these rolling window iterators to use a similar approach.

For very large arrays and subarrays, you could save memory like this:

>>> windows = rolling_window(a, 3)
>>> sub = [8, 4, 0]
>>> hits = numpy.ones((len(a) - len(sub) + 1,), dtype=bool)
>>> for i, x in enumerate(sub):
...     hits &= numpy.in1d(windows[:,i], [x])
... 
>>> hits
array([False, False, False,  True, False, False, False, False], dtype=bool)
>>> hits.nonzero()
(array([3]),)

On the other hand, this will probably be somewhat slower.

like image 29
senderle Avatar answered Oct 13 '22 09:10

senderle