String character identity paradox

Question

I'm completely stuck with this

>>> s = chr(8263)
>>> x = s[0]
>>> x is s[0]
False

How is this possible? Does this mean that accessing a string character by indexing create a new instance of the same character? Let's experiment:

>>> L = [s[0] for _ in range(1000)]
>>> len(set(L))
1
>>> ids = map(id, L)
>>> len(set(ids))
1000
>>>

Yikes what a waste of bytes ;) Or does it mean that str.__getitem__ has a hidden feature? Can somebody explain?

But this is not the end of my surprise:

>>> s = chr(8263)
>>> t = s
>>> print(t is s, id(t) == id(s))
True True

This is clear: t is an alias for s, so they represent the same object and identities coincide. But again, how the following is possible:

>>> print(t[0] is s[0])
False

s and t are the same object so what?

But worse:

>>> print(id(t[0]) == id(s[0]))
True

t[0] and s[0] have not been garbage collected, are considered as the same object by the is operator but have different ids? Can somebody explain?

Alex Riley · Accepted Answer

There are two points to make here.

First, Python does indeed create a new character with the __getitem__ call, but only if that character has ordinal value greater than 256.

For example:

>>> s = chr(256)
>>> s[0] is s
True

>>> t = chr(257)
>>> t[0] is t
False

This is because internally, the compiled getitem function checks the ordinal value of the single chracter and calls the get_latin1_char if that value is 256 or less. This allows some single-character strings to be shared. Otherwise, a new unicode object is created.

The second issue concerns garbage collection and shows that the interpreter can reuse memory addresses very quickly. When you write:

>>> s = t # = chr(257)
>>> t[0] is s[0]
False

Python first creates two new single character strings and then compares their memory addresses. These have different addresses (we have different objects as per the explanation above) so comparing the objects with is returns False.

On the other hand, we can have the seemingly paradoxical situation that:

>>> id(t[0]) == id(s[0])
True

But this is because the interpreter quickly reuses the memory address of t[0] when it creates the new string s[0] at a later moment in time.

If you examine the bytecode this line produces (e.g. with dis - see below), you see that the addresses for each side are allocated one after the other (a new string object is created and then id is called on it).

The references to the object t[0] drop to zero as soon as id(t[0]) is returned (we're doing the comparison on integers now, not the object itself). This means that s[0] can reuse the same memory address when it is created afterwards.

Here is the disassembled bytecode for the line id(t[0]) == id(s[0]) which I've annotated.

You can see that the lifetime of t[0] ends before s[0] is created (there are no references to it) hence its memory can be reused.

  2           0 LOAD_GLOBAL              0 (id)
              3 LOAD_GLOBAL              1 (t)
              6 LOAD_CONST               1 (0)
              9 BINARY_SUBSCR                     # t[0] is created
             10 CALL_FUNCTION            1        # id(t[0]) is computed...
                                                  # ...lifetime of string t[0] over
             13 LOAD_GLOBAL              0 (id)
             16 LOAD_GLOBAL              2 (s)
             19 LOAD_CONST               1 (0)
             22 BINARY_SUBSCR                     # s[0] is created...
                                                  # ...free to reuse t[0] memory
             23 CALL_FUNCTION            1        # id(s[0]) is computed
             26 COMPARE_OP               2 (==)   # the two ids are compared
             29 RETURN_VALUE

String character identity paradox

Tags:

python

string

python-internals

P. Ortiz

1 Answers

Alex Riley

Recent Activity

Donate For Us

String character identity paradox

Tags:

python

string

python-internals

P. Ortiz

1 Answers

Alex Riley

Related questions

Recent Activity

Donate For Us