Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Python determine if two strings are identical

I've tried to understand when Python strings are identical (aka sharing the same memory location). However during my tests, there seems to be no obvious explanation when two string variables that are equal share the same memory:

import sys
print(sys.version) # 3.4.3

# Example 1
s1 = "Hello"
s2 = "Hello"
print(id(s1) == id(s2)) # True

# Example 2
s1 = "Hello" * 3
s2 = "Hello" * 3
print(id(s1) == id(s2)) # True

# Example 3
i = 3
s1 = "Hello" * i
s2 = "Hello" * i
print(id(s1) == id(s2)) # False

# Example 4
s1 = "HelloHelloHelloHelloHello"
s2 = "HelloHelloHelloHelloHello"
print(id(s1) == id(s2)) # True

# Example 5
s1 = "Hello" * 5
s2 = "Hello" * 5
print(id(s1) == id(s2)) # False

Strings are immutable, and as far as I know Python tries to re-use existing immutable objects, by having other variables point to them instead of creating new objects in memory with the same value.

With this in mind, it seems obvious that Example 1 returns True.
It's still obvious (to me) that Example 2 returns True.

It's not obvious to me, that Example 3 returns False - am I not doing the same thing as in Example 2?!?

I stumbled upon this SO question:
Why does comparing strings in Python using either '==' or 'is' sometimes produce a different result?

and read through http://guilload.com/python-string-interning/ (though I probably didn't understand it all) and thougt - okay, maybe "interned" strings depend on the length, so I used HelloHelloHelloHelloHello in Example 4. The result was True.

And what the puzzled me, was doing the same as in Example 2 just with a bigger number (but it would effectively return the same string as Example 4) - however this time the result was False?!?

I have really no idea how Python decides whether or not to use the same memory object, or when to create a new one.

Are the any official sources that can explain this behavior?

like image 711
Mike Scotty Avatar asked Mar 22 '18 11:03

Mike Scotty


2 Answers

From the link you posted:

Avoiding large .pyc files

So why does 'a' * 21 is 'aaaaaaaaaaaaaaaaaaaaa' not evaluate to True? Do you remember the .pyc files you encounter in all your packages? Well, Python bytecode is stored in these files. What would happen if someone wrote something like this ['foo!'] * 10**9? The resulting .pyc file would be huge! In order to avoid this phenomena, sequences generated through peephole optimization are discarded if their length is superior to 20.

If you have the string "HelloHelloHelloHelloHello", Python will necessarily have to store it as it is (asking the interpreter to detect repeating patterns in a string to save space might be too much). However, when it comes to string values that can be computed at parsing time, such as "Hello" * 5, Python evaluate those as part of this so-called "peephole optimization", which can decide whether it is worth it or not to precompute the string. Since len("Hello" * 5) > 20, the interpreter leaves it as it is to avoid storing too many long strings.

EDIT:

As indicated in this question, you can check this on the source code in peephole.c, function fold_binops_on_constants, near the end you will see:

// ...
} else if (size > 20) {
    Py_DECREF(newconst);
    return -1;
}

EDIT 2:

Actually, that optimization code has recently been moved to the AST optimizer for Python 3.7, so now you would have to look into ast_opt.c, function fold_binop, which calls now function safe_multiply, which checks that the string is no longer than MAX_STR_SIZE, newly defined as 4096. So it seems that the limit has been significantly bumped up for the next releases.

like image 126
jdehesa Avatar answered Oct 31 '22 08:10

jdehesa


In Example 2 :

# Example 2
s1 = "Hello" * 3
s2 = "Hello" * 3
print(id(s1) == id(s2)) # True

Here the value of s1 and s2 are evaluated at compile time.so this will return true.

In Example 3 :

# Example 3
i = 3
s1 = "Hello" * i
s2 = "Hello" * i
print(id(s1) == id(s2)) # False

Here the values of s1 and s2 are evaluated at runtime and the result is not interned automatically so this will return false.This is to avoid excess memory allocation by creating "HelloHelloHello" string at runtime itself.

If you do intern manually it will return True

i = 3
s1 = "Hello" * i
s2 = "Hello" * i
print(id(intern(s1)) == id(intern(s2))) # True
like image 27
SaiPawan Avatar answered Oct 31 '22 09:10

SaiPawan