I'm completely stuck with this
>>> s = chr(8263)
>>> x = s[0]
>>> x is s[0]
False
How is this possible? Does this mean that accessing a string character by indexing create a new instance of the same character? Let's experiment:
>>> L = [s[0] for _ in range(1000)]
>>> len(set(L))
1
>>> ids = map(id, L)
>>> len(set(ids))
1000
>>>
Yikes what a waste of bytes ;) Or does it mean that str.__getitem__
has a hidden feature? Can somebody explain?
But this is not the end of my surprise:
>>> s = chr(8263)
>>> t = s
>>> print(t is s, id(t) == id(s))
True True
This is clear: t
is an alias for s
, so they represent the same object and identities coincide. But again, how the following is possible:
>>> print(t[0] is s[0])
False
s
and t
are the same object so what?
But worse:
>>> print(id(t[0]) == id(s[0]))
True
t[0]
and s[0]
have not been garbage collected, are considered as the same object by the is
operator but have different ids? Can somebody explain?
There are two points to make here.
First, Python does indeed create a new character with the __getitem__
call, but only if that character has ordinal value greater than 256.
For example:
>>> s = chr(256)
>>> s[0] is s
True
>>> t = chr(257)
>>> t[0] is t
False
This is because internally, the compiled getitem
function checks the ordinal value of the single chracter and calls the get_latin1_char
if that value is 256 or less. This allows some single-character strings to be shared. Otherwise, a new unicode object is created.
The second issue concerns garbage collection and shows that the interpreter can reuse memory addresses very quickly. When you write:
>>> s = t # = chr(257)
>>> t[0] is s[0]
False
Python first creates two new single character strings and then compares their memory addresses. These have different addresses (we have different objects as per the explanation above) so comparing the objects with is
returns False.
On the other hand, we can have the seemingly paradoxical situation that:
>>> id(t[0]) == id(s[0])
True
But this is because the interpreter quickly reuses the memory address of t[0]
when it creates the new string s[0]
at a later moment in time.
If you examine the bytecode this line produces (e.g. with dis
- see below), you see that the addresses for each side are allocated one after the other (a new string object is created and then id
is called on it).
The references to the object t[0]
drop to zero as soon as id(t[0])
is returned (we're doing the comparison on integers now, not the object itself). This means that s[0]
can reuse the same memory address when it is created afterwards.
Here is the disassembled bytecode for the line id(t[0]) == id(s[0])
which I've annotated.
You can see that the lifetime of t[0]
ends before s[0]
is created (there are no references to it) hence its memory can be reused.
2 0 LOAD_GLOBAL 0 (id)
3 LOAD_GLOBAL 1 (t)
6 LOAD_CONST 1 (0)
9 BINARY_SUBSCR # t[0] is created
10 CALL_FUNCTION 1 # id(t[0]) is computed...
# ...lifetime of string t[0] over
13 LOAD_GLOBAL 0 (id)
16 LOAD_GLOBAL 2 (s)
19 LOAD_CONST 1 (0)
22 BINARY_SUBSCR # s[0] is created...
# ...free to reuse t[0] memory
23 CALL_FUNCTION 1 # id(s[0]) is computed
26 COMPARE_OP 2 (==) # the two ids are compared
29 RETURN_VALUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With