Is there a simple way to find all relevant elements in NumPy array according to some pattern?
For example, consider the following array:
a = array(['zzzz', 'zzzd', 'zzdd', 'zddd', 'dddn', 'ddnz', 'dnzn', 'nznz',
'znzn', 'nznd', 'zndd', 'nddd', 'ddnn', 'dnnn', 'nnnz', 'nnzn',
'nznn', 'znnn', 'nnnn', 'nnnd', 'nndd', 'dddz', 'ddzn', 'dznn',
'znnz', 'nnzz', 'nzzz', 'zzzn', 'zznn', 'dddd', 'dnnd'], dtype=object)
And I need to to find all combinations which contain '**dd'.
I basically need a function, which receives the array as input and returns a smaller array with all relevant elements:
>> b = func(a, pattern='**dd')
>> b = array(['zzdd', 'zddd', 'zndd', 'nddd', 'nndd', 'dddd'], dtype=object)
Since it turns out you're actually working with pandas, there are simpler ways to do it at the Series level instead of just an ndarray, using the vectorized string operations:
In [32]: s = pd.Series(['zzzz', 'zzzd', 'zzdd', 'zddd', 'dddn', 'ddnz', 'dnzn', 'nznz',
...: 'znzn', 'nznd', 'zndd', 'nddd', 'ddnn', 'dnnn', 'nnnz', 'nnzn',
...: 'nznn', 'znnn', 'nnnn', 'nnnd', 'nndd', 'dddz', 'ddzn', 'dznn',
...: 'znnz', 'nnzz', 'nzzz', 'zzzn', 'zznn', 'dddd', 'dnnd'])
In [33]: s[s.str.endswith("dd")]
Out[33]:
2 zzdd
3 zddd
10 zndd
11 nddd
20 nndd
29 dddd
dtype: object
which produces a Series, or if you really insist on an ndarray:
In [34]: s[s.str.endswith("dd")].values
Out[34]: array(['zzdd', 'zddd', 'zndd', 'nddd', 'nndd', 'dddd'], dtype=object)
You can also use regular expressions, if you prefer:
In [49]: s[s.str.match(".*dd$")]
Out[49]:
2 zzdd
3 zddd
10 zndd
11 nddd
20 nndd
29 dddd
dtype: object
Here's an approach using numpy.core.defchararray.rfind
to get us the last index of a match and then we check if that index is 2 minus the length of each string. Now, the length of each string is 4
here, so we would look for the last indices that are 4 - 2 = 2
.
Thus, an implementation would be -
a[np.core.defchararray.rfind(a.astype(str),'dd')==2]
If the strings are not of equal lengths, we need to get the lengths, subtract 2
and then compare -
len_sub = np.array(list(map(len,a)))-len('dd')
a[np.core.defchararray.rfind(a.astype(str),'dd')==len_sub]
To test this out, let's add a longer string ending with dd
at the end of the given sample -
In [121]: a = np.append(a,'ewqjejwqjedd')
In [122]: len_sub = np.array(list(map(len,a)))-len('dd')
In [123]: a[np.core.defchararray.rfind(a.astype(str),'dd')==len_sub]
Out[123]: array(['zzdd', 'zddd', 'zndd', 'nddd', 'nndd', 'dddd',\
'ewqjejwqjedd'], dtype=object)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With