Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does the "is" keyword have a different behavior when there is a dot in the string?

Consider this code:

>>> x = "google" >>> x is "google" True >>> x = "google.com" >>> x is "google.com" False >>> 

Why is it like that?

To make sure the above is correct, I have just tested on Python 2.5.4, 2.6.5, 2.7b2, Python 3.1 on windows and Python 2.7b1 on Linux.

It looks like there is consistency across all of them, so it's by design. Am I missing something?

I just find it out that from some of my personal domain filtering script failing with that.

like image 431
YOU Avatar asked May 18 '10 15:05

YOU


1 Answers

is verifies object identity, and any implementation of Python, when it meets literal of immutable types, is perfectly free to either make a new object of that immutable type, or seek through existing objects of that type to see if some of them could be reused (by adding a new reference to the same underlying object). This is a pragmatic choice of optimization and not subject to semantic constraints, so your code should never rely on which path a give implementation may take (or it could break with a bugfix/optimization release of Python!).

Consider for example:

>>> import dis >>> def f(): ...   x = 'google.com' ...   return x is 'google.com' ...  >>> dis.dis(f)   2           0 LOAD_CONST               1 ('google.com')               3 STORE_FAST               0 (x)    3           6 LOAD_FAST                0 (x)               9 LOAD_CONST               1 ('google.com')              12 COMPARE_OP               8 (is)              15 RETURN_VALUE     

so in this particular implementation, within a function, your observation does not apply and only one object is made for the literal (any literal), and, indeed:

>>> f() True 

Pragmatically that's because within a function making a pass through the local table of constants (to save some memory by not making multiple constant immutable objects where one suffices) is pretty cheap and fast, and may offer good performance returns since the function may be called repeatedly afterwards.

But, the very same implementation, at the interactive prompt (Edit: I originally thought this would also happen at a module's top level, but a comment by @Thomas set me right, see later):

>>> x = 'google.com' >>> y = 'google.com' >>> id(x), id(y) (4213000, 4290864) 

does NOT bother trying to save memory that way -- the ids are different, i.e., distinct objects. There are potentially higher costs and lower returns and so the heuristics of this implementation's optimizer tell it to not bother searching and just go ahead.

Edit: at module top level, per @Thomas' observation, given e.g.:

$ cat aaa.py x = 'google.com' y = 'google.com' print id(x), id(y) 

again we see the table-of-constants-based memory-optimization in this implementation:

>>> import aaa 4291104 4291104 

(end of Edit per @Thomas' observation).

Lastly, again on the same implementation:

>>> x = 'google' >>> y = 'google' >>> id(x), id(y) (2484672, 2484672) 

the heuristics are different here because the literal string "looks like it might be an identifier" -- so it might be used in operation requiring interning... so the optimizer interns it anyway (and once interned, looking for it becomes very fast of course). And indeed, surprise surprise...:

>>> z = intern(x) >>> id(z) 2484672 

...x has been interned the very first time (as you see, the return value of intern is the same object as x and y, as it has the same id()). Of course, you shouldn't rely on this either -- the optimizer doesn't have to intern anything automatically, it's just an optimization heuristic; if you need interned string, intern them explicitly, just to be safe. When you do intern strings explicitly...:

>>> x = intern('google.com') >>> y = intern('google.com') >>> id(x), id(y) (4213000, 4213000) 

...then you do ensure exactly the same object (i.e., same id()) results each and every time -- so you can apply micro-optimizations such as checking with is rather than == (I've hardly ever found the miniscule performance gain to be worth the bother;-).

Edit: just to clarify, here are the kind of performance differences I'm talking about, on a slow Macbook Air...:

$ python -mtimeit -s"a='google';b='google'" 'a==b' 10000000 loops, best of 3: 0.132 usec per loop $ python -mtimeit -s"a='google';b='google'" 'a is b' 10000000 loops, best of 3: 0.107 usec per loop $ python -mtimeit -s"a='goo.gle';b='goo.gle'" 'a==b' 10000000 loops, best of 3: 0.132 usec per loop $ python -mtimeit -s"a='google';b='google'" 'a is b' 10000000 loops, best of 3: 0.106 usec per loop $ python -mtimeit -s"a=intern('goo.gle');b=intern('goo.gle')" 'a is b' 10000000 loops, best of 3: 0.0966 usec per loop $ python -mtimeit -s"a=intern('goo.gle');b=intern('goo.gle')" 'a == b' 10000000 loops, best of 3: 0.126 usec per loop 

...a few tens of nanoseconds either way, at most. So, worth even thinking about only in the most extreme "optimize the [expletive deleted] out of this [expletive deleted] performance bottleneck" situations!-)

like image 156
Alex Martelli Avatar answered Oct 23 '22 06:10

Alex Martelli