Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding entries containing a substring in a numpy array?

Tags:

I tried to find entries in an Array containing a substring with np.where and an in condition:

import numpy as np foo = "aa" bar = np.array(["aaa", "aab", "aca"]) np.where(foo in bar) 

this only returns an empty Array.
Why is that so?
And is there a good alternative solution?

like image 212
SiOx Avatar asked Aug 16 '16 11:08

SiOx


People also ask

Can a NumPy array contains string?

The elements of a NumPy array, or simply an array, are usually numbers, but can also be boolians, strings, or other objects.

How do you check if an element is in an array NumPy?

Using Numpy array, we can easily find whether specific values are present or not. For this purpose, we use the “in” operator. “in” operator is used to check whether certain element and values are present in a given sequence and hence return Boolean values 'True” and “False“.

Which method is used to search for a value in a NumPy array?

You can search an array for a certain value, and return the indexes that get a match. To search an array, use the where() method.


2 Answers

We can use np.core.defchararray.find to find the position of foo string in each element of bar, which would return -1 if not found. Thus, it could be used to detect whether foo is present in each element or not by checking for -1 on the output from find. Finally, we would use np.flatnonzero to get the indices of matches. So, we would have an implementation, like so -

np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1) 

Sample run -

In [91]: bar Out[91]:  array(['aaa', 'aab', 'aca'],        dtype='|S3')  In [92]: foo Out[92]: 'aa'  In [93]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1) Out[93]: array([0, 1])  In [94]: bar[2] = 'jaa'  In [95]: np.flatnonzero(np.core.defchararray.find(bar,foo)!=-1) Out[95]: array([0, 1, 2]) 
like image 187
Divakar Avatar answered Sep 19 '22 16:09

Divakar


Look at some examples of using in:

In [19]: bar = np.array(["aaa", "aab", "aca"])  In [20]: 'aa' in bar Out[20]: False  In [21]: 'aaa' in bar Out[21]: True  In [22]: 'aab' in bar Out[22]: True  In [23]: 'aab' in list(bar)  

It looks like in when used with an array works as though the array was a list. ndarray does have a __contains__ method, so in works, but it is probably simple.

But in any case, note that in alist does not check for substrings. The strings __contains__ does the substring test, but I don't know any builtin class that propagates the test down to the component strings.

As Divakar shows there is a collection of numpy functions that applies string methods to individual elements of an array.

In [42]: np.char.find(bar, 'aa') Out[42]: array([ 0,  0, -1]) 

Docstring:
This module contains a set of functions for vectorized string operations and methods. The preferred alias for defchararray is numpy.char.

For operations like this I think the np.char speeds are about same as with:

In [49]: np.frompyfunc(lambda x: x.find('aa'), 1, 1)(bar) Out[49]: array([0, 0, -1], dtype=object)  In [50]: np.frompyfunc(lambda x: 'aa' in x, 1, 1)(bar) Out[50]: array([True, True, False], dtype=object) 

Further tests suggest that the ndarray __contains__ operates on the flat version of the array - that is, shape doesn't affect its behavior.

like image 25
hpaulj Avatar answered Sep 17 '22 16:09

hpaulj