Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange behavior when comparing unicode objects with string objects

when comparing two strings in python, it works fine and when comparing a string object with a unicode object it fails as expected however when comparing a string object with a converted unicode (unicode --> str) object it fails

A Demo:

Works as expected:

>>> if 's' is 's': print "Hurrah!"
... 
Hurrah!

Pretty much yeah:

>>> if 's' is u's': print "Hurrah!"
... 

Not expected:

>>> if 's' is str(u's'): print "Hurrah!"
... 

Why doesn't the third example work as expected when both the type's are of the same class?

>>> type('s')
<type 'str'>

>>> type(str(u's'))
<type 'str'>
like image 549
K DawG Avatar asked Dec 07 '13 06:12

K DawG


3 Answers

Don't use is for this, use ==. You're comparing whether the objects have the same identity, not whether they are equal. Of course, if the are the same object, they will be equal (==), but if they are equal, they aren't necessarily the same object.

The fact that the first one works is an implementation detail of CPython. Small strings, since they're immutable can be interned by the interpreter. Every time you put the string "s" in your source code, Cpython reuses the same object. however, apparently str("s") returns a new string with the same value. This isn't all that surprising.


You might be asking yourself, "why intern the string 's' at all?". That's a reasonable question. After all, it's a short string -- How much memory could having multiple copies floating around in your source take? The answer (I think) is because of dictionary lookups. Since dicts with strings as keys are so common in python, you can speed up the hash function/equality checking of keys by doing lightning fast pointer comparisons (falling back on slower strcmp) when the pointer comparison returns false.

like image 186
mgilson Avatar answered Nov 18 '22 07:11

mgilson


The is operator is used to compare the memory location of the two operands. Since strings are immutable, 's' and 's' occupy the same location in memory.

Due to the way unicode is handled in python2.7, u's' and 's' are stored in the same way/place. Therefore, they occupy the same memory location. Therefore 's' is u's' evaluates to True.
As @mgilson points out, 's' and u's' are of different types, and therefore don't occupy the same memory location, leading to 's' is u's' evaluating to False

However, when you call str(u's'), a new string is created and returned. This new string, because it is created anew, lives in a new location in memory, which is why the is comparison fails.

What you really want is to check that they are equivalent strings, so use ==

In [1]: 's' == u's'
Out[1]: True

In [2]: 's' == 's'
Out[2]: True

In [3]: 's' == str(u's')
Out[3]: True
like image 44
inspectorG4dget Avatar answered Nov 18 '22 07:11

inspectorG4dget


Use == for value comparison and is for reference comparison. If objects have the same id, it evaluates to True, otherwise as with str(), the id is altered, so you get False.

like image 2
Steve P. Avatar answered Nov 18 '22 06:11

Steve P.