Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python find string pattern in numpy array of strings

I have a numpy array of strings 'A' of length 100 and they are sentences of different sizes. It is string NOT numpy strings

>>> type(A[0])
<type 'str'>

I want to find the location of strings in A which contain certain pattern like 'zzz' in them.

I tried

np.core.defchararray.find(A, 'zzz')

gives error:

TypeError: string operation on non-string array

I assume I will need to change each of the 'str' in A to numpy string ?

Edit:

I want to find the index of 'zzz' appearance in A

like image 446
Zanam Avatar asked Jun 02 '16 18:06

Zanam


3 Answers

No need to be fancy with this, you can get the list of indicies with a list comprehension and the in operator:

>>> import numpy as np
>>> lst = ["aaa","aazzz","zzz"]
>>> n = np.array(lst)
>>> [i for i,item in enumerate(n) if "zzz" in item]
[1, 2]

Note that here the elements of the array are actually numpy strings, but the in operator will work for regular strings too, so it's moot.

like image 136
wnnmaw Avatar answered Sep 19 '22 15:09

wnnmaw


The issue here is the nature of your array of strings.

If I make the array like:

In [362]: x=np.array(['one','two','three'])

In [363]: x
Out[363]: 
array(['one', 'two', 'three'], 
      dtype='<U5')

In [364]: type(x[0])
Out[364]: numpy.str_

The elements are special kind of string, implicitly padded to 5 characters (the longest, 'np.char methods work on this kind of array

In [365]: np.char.find(x,'one')
Out[365]: array([ 0, -1, -1])

But if I make a object array that contains strings, it produces your error

In [366]: y=np.array(['one','two','three'],dtype=object)

In [367]: y
Out[367]: array(['one', 'two', 'three'], dtype=object)

In [368]: type(y[0])
Out[368]: str

In [369]: np.char.find(y,'one')
...
/usr/lib/python3/dist-packages/numpy/core/defchararray.py in find(a, sub, start, end)
...
TypeError: string operation on non-string array

And more often than not, an object array has to be treated as a list.

In [370]: y
Out[370]: array(['one', 'two', 'three'], dtype=object)

In [371]: [i.find('one') for i in y]
Out[371]: [0, -1, -1]

In [372]: np.array([i.find('one') for i in y])
Out[372]: array([ 0, -1, -1])

The np.char methods are convenient, but they aren't faster. They still have to iterate through the array applying regular string operations to each element.

like image 37
hpaulj Avatar answered Sep 19 '22 15:09

hpaulj


you can try this one:

np.core.defchararray.find(A.astype(str), 'zzz')
like image 24
Matias Thayer Avatar answered Sep 20 '22 15:09

Matias Thayer