Why does the size of this Python String change on a failed int conversion

Tags:

From the tweet here:

import sys x = 'ñ' print(sys.getsizeof(x)) int(x) #throws an error print(sys.getsizeof(x))

We get 74, then 77 bytes for the two getsizeof calls.

It looks like we are adding 3 bytes to the object, from the failed int call.

Some more examples from twitter (you may need to restart python to reset the size back to 74):

Click to copy

x = 'ñ' y = 'ñ' int(x) print(sys.getsizeof(y))

77!

Click to copy

print(sys.getsizeof('ñ')) int('ñ') print(sys.getsizeof('ñ'))

74, then 77.

365

asked Nov 01 '17 19:11

jeremycg

1 Answers

The code that converts strings to ints in CPython 3.6 requests a UTF-8 form of the string to work with:

Click to copy

buffer = PyUnicode_AsUTF8AndSize(asciidig, &buflen);

and the string creates the UTF-8 representation the first time it's requested and caches it on the string object:

Click to copy

if (PyUnicode_UTF8(unicode) == NULL) {     assert(!PyUnicode_IS_COMPACT_ASCII(unicode));     bytes = _PyUnicode_AsUTF8String(unicode, NULL);     if (bytes == NULL)         return NULL;     _PyUnicode_UTF8(unicode) = PyObject_MALLOC(PyBytes_GET_SIZE(bytes) + 1);     if (_PyUnicode_UTF8(unicode) == NULL) {         PyErr_NoMemory();         Py_DECREF(bytes);         return NULL;     }     _PyUnicode_UTF8_LENGTH(unicode) = PyBytes_GET_SIZE(bytes);     memcpy(_PyUnicode_UTF8(unicode),               PyBytes_AS_STRING(bytes),               _PyUnicode_UTF8_LENGTH(unicode) + 1);     Py_DECREF(bytes); }

The extra 3 bytes are for the UTF-8 representation.

You might be wondering why the size doesn't change when the string is something like '40' or 'plain ascii text'. That's because if the string is in "compact ascii" representation, Python doesn't create a separate UTF-8 representation. It returns the ASCII representation directly, which is already valid UTF-8:

Click to copy

#define PyUnicode_UTF8(op)                              \     (assert(_PyUnicode_CHECK(op)),                      \      assert(PyUnicode_IS_READY(op)),                    \      PyUnicode_IS_COMPACT_ASCII(op) ?                   \          ((char*)((PyASCIIObject*)(op) + 1)) :          \          _PyUnicode_UTF8(op))

You also might wonder why the size doesn't change for something like '１'. That's U+FF11 FULLWIDTH DIGIT ONE, which int treats as equivalent to '1'. That's because one of the earlier steps in the string-to-int process is

Click to copy

asciidig = _PyUnicode_TransformDecimalAndSpaceToASCII(u);

which converts all whitespace characters to ' ' and converts all Unicode decimal digits to the corresponding ASCII digits. This conversion returns the original string if it doesn't end up changing anything, but when it does make changes, it creates a new string, and the new string is the one that gets a UTF-8 representation created.

As for the cases where calling int on one string looks like it affects another, those are actually the same string object. There are many conditions under which Python will reuse strings, all just as firmly in Weird Implementation Detail Land as everything we've discussed so far. For 'ñ', the reuse happens because this is a single-character string in the Latin-1 range ('\x00'-'\xff'), and the implementation stores and reuses those.

174

answered Sep 16 '22 21:09

user2357112 supports Monica

Related questions
                            
                                Pandas: Sampling a DataFrame [duplicate]
                            
                                Running Jupyter with multiple Python and IPython paths
                            
                                Can't connect to Flask web service, connection refused
                            
                                How do I compute the intersection point of two lines?
                            
                                Create or append to a list in a dictionary - can this be shortened?
                            
                                Python 3.7 anaconda environment - import _ssl DLL load fail error
                            
                                python save image from url
                            
                                if var == False
                            
                                Python.h missing from Ubuntu 12.04
                            
                                update django database to reflect changes in existing models
                            
                                ctypes error: libdc1394 error: Failed to initialize libdc1394
                            
                                Py.test No module named *
                            
                                Any way to make {% extends '...' %} conditional? - Django
                            
                                Convert list of ints to one number?
                            
                                NLTK download SSL: Certificate verify failed
                            
                                Break string into list of characters in Python [duplicate]
                            
                                How to run a code in an Amazone's EC2 instance?
                            
                                Paramiko : Error reading SSH protocol banner
                            
                                How can I profile a multithread program in Python?
                            
                                Django and Bootstrap: What app is recommended? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does the size of this Python String change on a failed int conversion

Tags:

python

string

python-3.x

python-internals

unicode

jeremycg

People also ask

1 Answers

user2357112 supports Monica

Recent Activity

Donate For Us