Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode elementwise string comparison in numpy

I had a question about equality comparison with numpy and arrays of strings. Say I define the following array:

x = np.array(['yes', 'no', 'maybe'])

Then I can test for equality with other strings and it does element wise comparison with the single string (following, I think, the broadcasting rules here: http://docs.scipy.org/doc/numpy-1.10.1/user/basics.broadcasting.html ?):

'yes' == x
#op : array([ True, False, False], dtype=bool)

x == 'yes'
#op : array([ True, False, False], dtype=bool)

However, if I compare with unicode strings I get different behaviour with element wise comparison only happening if I compare the array to the string and only a single comparison being made if I compare the string to the array.

x == u'yes'
#op : array([ True, False, False], dtype=bool)

u'yes' == x
#op : False

I can't find details of this behaviour in the numpy docs and was hoping someone could explain or point me to details of why comparison with unicode strings behaves differently?

like image 208
jay--bee Avatar asked Jan 29 '16 13:01

jay--bee


People also ask

How to perform element-wise comparison of two string arrays using Python NumPy?

To perform element-wise comparison of two string arrays using a comparison operator, use the numpy.compare_chararrays () method in Python Numpy. The arr1 and arr2 are the two input string arrays of the same shape to be compared. The 3rd parameter is rstrip, if True, the spaces at the end of Strings are removed before the comparison.

How to concatenate two arrays of NumPy strings?

The numpy.char module provides a set of vectorized string operations for arrays of type numpy.str_ or numpy.bytes_ . All of them are based on the string methods in the Python standard library. Return element-wise string concatenation for two arrays of str or unicode. Return (a * i), that is string multiple concatenation, element-wise.

What is NumPy char in Python?

The numpy.char module provides a set of vectorized string operations for arrays of type numpy.str_ or numpy.bytes_ . All of them are based on the string methods in the Python standard library.

What is the difference between NumPy count and NumPy rfind?

numpy.count () : This function returns the number of occurrences of a substring in the given string. numpy.rfind () : This function returns the highest index of the substring if found in given string. If not found then it returns -1.


1 Answers

The relevant piece of information is this part of the Python's coercion rules:

For objects xand y, first x.__op__(y) is tried. If this is not implemented or returns NotImplemented, y.__rop__(x) is tried.

Using your numpy array x, when the left-hand side is a str ('yes' == x):

  • 'yes'.__eq__(x) returns NotImplemented and
  • therefore resolves to x.__eq__('yes') – resulting in numpy's element-wise comparison.

However, when the left-hand side is a unicode (u'yes' == x):

  • u'yes'.__eq__(x) simply returns False.

The reason for the different __eq__ behaviours is that str.__eq__() simply returns NotImplemented if its argument is not a str type, whereas unicode.__eq__() first tries to convert its argument to a unicode, and only returns NotImplemented if that conversion fails. In this case, the numpy array is convertible to a unicode: u'yes' == x is essentially u'yes' == unicode(x).

like image 122
一二三 Avatar answered Oct 21 '22 09:10

一二三