From the tweet here:
import sys x = 'ñ' print(sys.getsizeof(x)) int(x) #throws an error print(sys.getsizeof(x))
We get 74, then 77 bytes for the two getsizeof
calls.
It looks like we are adding 3 bytes to the object, from the failed int call.
Some more examples from twitter (you may need to restart python to reset the size back to 74):
x = 'ñ' y = 'ñ' int(x) print(sys.getsizeof(y))
77!
print(sys.getsizeof('ñ')) int('ñ') print(sys.getsizeof('ñ'))
74, then 77.
To convert, or cast, a string to an integer in Python, you use the int() built-in function. The function takes in as a parameter the initial string you want to convert, and returns the integer equivalent of the value you passed. The general syntax looks something like this: int("str") .
Python strings are "immutable" which means they cannot be changed after they are created (Java strings also use this immutable style). Since strings can't be changed, we construct *new* strings as we go to represent computed values.
To convert an integer to string in Python, use the str() function. This function takes any data type and converts it into a string, including integers. Use the syntax print(str(INT)) to return the int as a str , or string.
The code that converts strings to ints in CPython 3.6 requests a UTF-8 form of the string to work with:
buffer = PyUnicode_AsUTF8AndSize(asciidig, &buflen);
and the string creates the UTF-8 representation the first time it's requested and caches it on the string object:
if (PyUnicode_UTF8(unicode) == NULL) { assert(!PyUnicode_IS_COMPACT_ASCII(unicode)); bytes = _PyUnicode_AsUTF8String(unicode, NULL); if (bytes == NULL) return NULL; _PyUnicode_UTF8(unicode) = PyObject_MALLOC(PyBytes_GET_SIZE(bytes) + 1); if (_PyUnicode_UTF8(unicode) == NULL) { PyErr_NoMemory(); Py_DECREF(bytes); return NULL; } _PyUnicode_UTF8_LENGTH(unicode) = PyBytes_GET_SIZE(bytes); memcpy(_PyUnicode_UTF8(unicode), PyBytes_AS_STRING(bytes), _PyUnicode_UTF8_LENGTH(unicode) + 1); Py_DECREF(bytes); }
The extra 3 bytes are for the UTF-8 representation.
You might be wondering why the size doesn't change when the string is something like '40'
or 'plain ascii text'
. That's because if the string is in "compact ascii" representation, Python doesn't create a separate UTF-8 representation. It returns the ASCII representation directly, which is already valid UTF-8:
#define PyUnicode_UTF8(op) \ (assert(_PyUnicode_CHECK(op)), \ assert(PyUnicode_IS_READY(op)), \ PyUnicode_IS_COMPACT_ASCII(op) ? \ ((char*)((PyASCIIObject*)(op) + 1)) : \ _PyUnicode_UTF8(op))
You also might wonder why the size doesn't change for something like '1'
. That's U+FF11 FULLWIDTH DIGIT ONE, which int
treats as equivalent to '1'
. That's because one of the earlier steps in the string-to-int process is
asciidig = _PyUnicode_TransformDecimalAndSpaceToASCII(u);
which converts all whitespace characters to ' '
and converts all Unicode decimal digits to the corresponding ASCII digits. This conversion returns the original string if it doesn't end up changing anything, but when it does make changes, it creates a new string, and the new string is the one that gets a UTF-8 representation created.
As for the cases where calling int
on one string looks like it affects another, those are actually the same string object. There are many conditions under which Python will reuse strings, all just as firmly in Weird Implementation Detail Land as everything we've discussed so far. For 'ñ'
, the reuse happens because this is a single-character string in the Latin-1 range ('\x00'
-'\xff'
), and the implementation stores and reuses those.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With