Why is sys.getsizeof()
larger for a Python str
of length 1 than for a string of length 2? (For length > 2, the relationship seems to increase monotonically as expected.)
Example:
>>> from string import ascii_lowercase
>>> import sys
>>> strings = [ascii_lowercase[:i] for i, _ in enumerate(ascii_lowercase, 1)]
>>> strings
['a',
'ab',
'abc',
'abcd',
'abcde',
'abcdef',
'abcdefg',
# ...
>>> sizes = dict(enumerate(map(sys.getsizeof, strings), 1))
>>> sizes
{1: 58, # <--- ??
2: 51,
3: 52,
4: 53,
5: 54,
6: 55,
7: 56,
8: 57,
9: 58,
10: 59,
11: 60,
12: 61,
13: 62,
14: 63,
15: 64,
16: 65,
# ...
It seems it has to do with str.__sizeof__
, but I don't know C well enough at all to dig into what's going on in this case.
Edit:
This appears to be related to a single Pandas import in an IPython startup file.
I can reproduce the behavior in a plain Python session also:
~$ python
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from string import ascii_lowercase
>>> import sys
>>> strings = [ascii_lowercase[:i] for i, _ in enumerate(ascii_lowercase, 1)]
>>> sizes = dict(enumerate(map(sys.getsizeof, strings), 1))
>>> sizes
{1: 50, 2: 51, 3: 52, 4: 53, 5: 54, 6: 55, 7: 56, 8: 57, 9: 58, 10: 59, 11: 60, 12: 61, 13: 62, 14: 63, 15: 64, 16: 65, 17: 66, 18: 67, 19: 68, 20: 69, 21: 70, 22: 71, 23: 72, 24: 73, 25: 74, 26: 75}
>>> import pandas as pd
>>> sizes = dict(enumerate(map(sys.getsizeof, strings), 1))
>>> sizes
{1: 58, 2: 51, 3: 52, 4: 53, 5: 54, 6: 55, 7: 56, 8: 57, 9: 58, 10: 59, 11: 60, 12: 61, 13: 62, 14: 63, 15: 64, 16: 65, 17: 66, 18: 67, 19: 68, 20: 69, 21: 70, 22: 71, 23: 72, 24: 73, 25: 74, 26: 75}
>>> pd.__version__
'0.23.2'
When you import pandas
, it does a whole ton of NumPy stuff, including calling UNICODE_setitem
on all the single-ASCII-letter strings, and presumably somewhere else doing something similar on the single-ASCII-digit strings.
That NumPy function calls the deprecated C API PyUnicode_AsUnicode
.
When you call that in CPython 3.3+, that caches the wchar_t *
representation on the string's internal struct, in its wstr
member, as the two wchar_t values w'a'
and '\0'
, which takes 8 bytes on a 32-bit-wchar_t
build of Python. And str.__size__
takes that into account.
So, all of the single-character interned strings for ASCII letters and digits—but nothing else—end up 8 bytes larger.
First, we know that it's apparently something that happens on import pandas
(per Brad Solomon's answer.) It may happen on np.set_printoptions(precision=4, threshold=625, edgeitems=10)
(miradulo posted, but then deleted, a comment to that effect on ShadowRanger's answer), but definitely not on import numpy
.
Second, we know that it happens to 'a'
, but what about other single-character strings?
To verify the former, and to test the latter, I ran this code:
import sys
strings = [chr(i) for i in (0, 10, 17, 32, 34, 47, 48, 57, 58, 64, 65, 90, 91, 96, 97, 102, 103, 122, 123, 130, 0x0222, 0x12345)]
sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)
import numpy as np
sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)
np.set_printoptions(precision=4, threshold=625, edgeitems=10)
sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)
import pandas
sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)
On multiple CPython installations (but all 64-bit CPython 3.4 or later on Linux or macOS), I got the same results:
{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 50, '9': 50, ':': 50, '@': 50, 'A': 50, 'Z': 50, '[': 50, '`': 50, 'a': 50, 'f': 50, 'g': 50, 'z': 50, '{': 50, '\x82': 74, 'Ȣ': 76, '𒍅': 80}
{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 50, '9': 50, ':': 50, '@': 50, 'A': 50, 'Z': 50, '[': 50, '`': 50, 'a': 50, 'f': 50, 'g': 50, 'z': 50, '{': 50, '\x82': 74, 'Ȣ': 76, '𒍅': 80}
{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 50, '9': 50, ':': 50, '@': 50, 'A': 50, 'Z': 50, '[': 50, '`': 50, 'a': 50, 'f': 50, 'g': 50, 'z': 50, '{': 50, '\x82': 74, 'Ȣ': 76, '𒍅': 80}
{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 58, '9': 58, ':': 50, '@': 50, 'A': 58, 'Z': 58, '[': 50, '`': 50, 'a': 58, 'f': 58, 'g': 58, 'z': 58, '{': 50, '\x82': 74, 'Ȣ': 76, '𒍅': 80}
So, import numpy
changes nothing, and so does set_printoptions
(presumably why miradulo deleted the comment…), but import pandas
does.
And it apparently affects ASCII digits and letters, but nothing else.
Also, if you change all of the print
s to print(sizes.values())
, so the strings never get encoded for output, you get the same results, which implies that either it's not about caching the UTF-8, or it is but that's always happening even if we don't force it.
The obvious possibility is that whatever Pandas is calling is using one of the legacy PyUnicode
API to generate single-character strings for all of the ASCII digits and letters. So these strings end up not in compact-ASCII format, but in legacy-ready format, right? (For details on what that means, see the comments in the source.)
Nope. Using the code from my superhackyinternals
, we can see that it's still in compact-ascii format:
import ctypes
import sys
from internals import PyUnicodeObject
s = 'a'
print(sys.getsizeof(s))
ps = PyUnicodeObject.from_address(s)
print(ps, ps.kind, ps.length, ps.interned, ps.ascii, ps.compact, ps.ready)
addr = id(s) + PyUnicodeObject.utf8_length.offset
buf = (ctypes.c_char * 2).from_address(addr)
print(addr, bytes(buf))
import pandas
print(sys.getsizeof(s))
s = 'a'
ps = PyUnicodeObject.from_address(s)
print(ps, ps.kind, ps.length, ps.interned, ps.ascii, ps.compact, ps.ready)
addr = id(s) + PyUnicodeObject.utf8_length.offset
buf = (ctypes.c_char * 2).from_address(addr)
print(addr, bytes(buf))
We can see that Pandas changes the size from 50 to 58, but the fields are still:
<__main__.PyUnicodeObject object at 0x101bbae18> 1 1 1 1 1 1
… in other words, it's 1BYTE_KIND
, length 1, mortal-interned, ASCII, compact, and ready.
But, if you look at ps.wstr
, before Pandas it's a null pointer, while after Pandas it's a pointer to the wchar_t
string w"a\0"
. And str.__sizeof__
takes that wstr
size into account.
So, the question is, how do you end up with an ascii-compact string that has a wstr
value?
Simple: you call PyUnicode_AsUnicode
on it (or one of the other deprecated functions or macros that accesses the 3.2-style native wchar_t *
internal storage. That native internal storage doesn't actually exist in 3.3+. So, for backward compatibility, those calls are handled by creating that storage on the fly, sticking it on the wstr
member, and calling the appropriate PyUnicode_AsUCS[24]
function to decode to that storage. (Unless you're dealing with a compact string whose kind happens to match the wchar_t
width, in which case wstr
is just a pointer to the native storage after all.)
You'd expect str.__sizeof__
to ideally include that extra storage, and from the source, you can see that it does.
Let's verify that:
import ctypes
import sys
s = 'a'
print(sys.getsizeof(s))
ctypes.pythonapi.PyUnicode_AsUnicode.argtypes = [ctypes.py_object]
ctypes.pythonapi.PyUnicode_AsUnicode.restype = ctypes.c_wchar_p
print(ctypes.pythonapi.PyUnicode_AsUnicode(s))
print(sys.getsizeof(s))
Tada, our 50 goes to 58.
So, how do you work out where this gets called?
There are actually a ton of calls to PyUnicode_AsUnicode
, and the PyUnicode_AS_UNICODE
macro, and other functions that call them, throughout Pandas and Numpy. So I ran Python in lldb and attached a breakpoint to PyUnicode_AsUnicode
, with a script that skips if the calling stack frame is the same as last time.
The first few calls involve datetime formats. Then there's one with a single letter. And the stack frame is:
multiarray.cpython-36m-darwin.so`UNICODE_setitem + 296
… and above multiarray
it's pure Python all the way up to the import pandas
. So, if you want to know exactly where Pandas is calling this function, you'd need to debug in pdb
, which I haven't done yet. But I think we've got enough info now.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With