Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inconsistency with `sys.getsizeof`

Why is sys.getsizeof() larger for a Python str of length 1 than for a string of length 2? (For length > 2, the relationship seems to increase monotonically as expected.)

Example:

>>> from string import ascii_lowercase
>>> import sys

>>> strings = [ascii_lowercase[:i] for i, _ in enumerate(ascii_lowercase, 1)]
>>> strings
['a',
 'ab',
 'abc',
 'abcd',
 'abcde',
 'abcdef',
 'abcdefg',
 # ...

>>> sizes = dict(enumerate(map(sys.getsizeof, strings), 1))
>>> sizes
{1: 58,   # <--- ??
 2: 51,
 3: 52,
 4: 53,
 5: 54,
 6: 55,
 7: 56,
 8: 57,
 9: 58,
 10: 59,
 11: 60,
 12: 61,
 13: 62,
 14: 63,
 15: 64,
 16: 65,
 # ...

It seems it has to do with str.__sizeof__, but I don't know C well enough at all to dig into what's going on in this case.


Edit:

This appears to be related to a single Pandas import in an IPython startup file.

I can reproduce the behavior in a plain Python session also:

 ~$ python
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:07:29) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from string import ascii_lowercase
>>> import sys
>>> strings = [ascii_lowercase[:i] for i, _ in enumerate(ascii_lowercase, 1)]
>>> sizes = dict(enumerate(map(sys.getsizeof, strings), 1))
>>> sizes
{1: 50, 2: 51, 3: 52, 4: 53, 5: 54, 6: 55, 7: 56, 8: 57, 9: 58, 10: 59, 11: 60, 12: 61, 13: 62, 14: 63, 15: 64, 16: 65, 17: 66, 18: 67, 19: 68, 20: 69, 21: 70, 22: 71, 23: 72, 24: 73, 25: 74, 26: 75}
>>> import pandas as pd
>>> sizes = dict(enumerate(map(sys.getsizeof, strings), 1))
>>> sizes
{1: 58, 2: 51, 3: 52, 4: 53, 5: 54, 6: 55, 7: 56, 8: 57, 9: 58, 10: 59, 11: 60, 12: 61, 13: 62, 14: 63, 15: 64, 16: 65, 17: 66, 18: 67, 19: 68, 20: 69, 21: 70, 22: 71, 23: 72, 24: 73, 25: 74, 26: 75}
>>> pd.__version__
'0.23.2'
like image 733
Brad Solomon Avatar asked Jul 16 '18 21:07

Brad Solomon


1 Answers

When you import pandas, it does a whole ton of NumPy stuff, including calling UNICODE_setitem on all the single-ASCII-letter strings, and presumably somewhere else doing something similar on the single-ASCII-digit strings.

That NumPy function calls the deprecated C API PyUnicode_AsUnicode.

When you call that in CPython 3.3+, that caches the wchar_t * representation on the string's internal struct, in its wstr member, as the two wchar_t values w'a' and '\0', which takes 8 bytes on a 32-bit-wchar_t build of Python. And str.__size__ takes that into account.

So, all of the single-character interned strings for ASCII letters and digits—but nothing else—end up 8 bytes larger.


First, we know that it's apparently something that happens on import pandas (per Brad Solomon's answer.) It may happen on np.set_printoptions(precision=4, threshold=625, edgeitems=10) (miradulo posted, but then deleted, a comment to that effect on ShadowRanger's answer), but definitely not on import numpy.

Second, we know that it happens to 'a', but what about other single-character strings?

To verify the former, and to test the latter, I ran this code:

import sys

strings = [chr(i) for i in (0, 10, 17, 32, 34, 47, 48, 57, 58, 64, 65, 90, 91, 96, 97, 102, 103, 122, 123, 130, 0x0222, 0x12345)]

sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)

import numpy as np
sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)

np.set_printoptions(precision=4, threshold=625, edgeitems=10)
sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)

import pandas
sizes = {c: sys.getsizeof(c) for c in strings}
print(sizes)

On multiple CPython installations (but all 64-bit CPython 3.4 or later on Linux or macOS), I got the same results:

{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 50, '9': 50, ':': 50, '@': 50, 'A': 50, 'Z': 50, '[': 50, '`': 50, 'a': 50, 'f': 50, 'g': 50, 'z': 50, '{': 50, '\x82': 74, 'Ȣ': 76, '𒍅': 80}
{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 50, '9': 50, ':': 50, '@': 50, 'A': 50, 'Z': 50, '[': 50, '`': 50, 'a': 50, 'f': 50, 'g': 50, 'z': 50, '{': 50, '\x82': 74, 'Ȣ': 76, '𒍅': 80}
{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 50, '9': 50, ':': 50, '@': 50, 'A': 50, 'Z': 50, '[': 50, '`': 50, 'a': 50, 'f': 50, 'g': 50, 'z': 50, '{': 50, '\x82': 74, 'Ȣ': 76, '𒍅': 80}
{'\x00': 50, '\n': 50, '\x11': 50, ' ': 50, '"': 50, '/': 50, '0': 58, '9': 58, ':': 50, '@': 50, 'A': 58, 'Z': 58, '[': 50, '`': 50, 'a': 58, 'f': 58, 'g': 58, 'z': 58, '{': 50, '\x82': 74, 'Ȣ': 76, '𒍅': 80}

So, import numpy changes nothing, and so does set_printoptions (presumably why miradulo deleted the comment…), but import pandas does.

And it apparently affects ASCII digits and letters, but nothing else.

Also, if you change all of the prints to print(sizes.values()), so the strings never get encoded for output, you get the same results, which implies that either it's not about caching the UTF-8, or it is but that's always happening even if we don't force it.


The obvious possibility is that whatever Pandas is calling is using one of the legacy PyUnicode API to generate single-character strings for all of the ASCII digits and letters. So these strings end up not in compact-ASCII format, but in legacy-ready format, right? (For details on what that means, see the comments in the source.)

Nope. Using the code from my superhackyinternals, we can see that it's still in compact-ascii format:

import ctypes
import sys
from internals import PyUnicodeObject

s = 'a'
print(sys.getsizeof(s))
ps = PyUnicodeObject.from_address(s)
print(ps, ps.kind, ps.length, ps.interned, ps.ascii, ps.compact, ps.ready)
addr = id(s) + PyUnicodeObject.utf8_length.offset
buf = (ctypes.c_char * 2).from_address(addr)
print(addr, bytes(buf))

import pandas
print(sys.getsizeof(s))
s = 'a'
ps = PyUnicodeObject.from_address(s)
print(ps, ps.kind, ps.length, ps.interned, ps.ascii, ps.compact, ps.ready)
addr = id(s) + PyUnicodeObject.utf8_length.offset
buf = (ctypes.c_char * 2).from_address(addr)
print(addr, bytes(buf))

We can see that Pandas changes the size from 50 to 58, but the fields are still:

<__main__.PyUnicodeObject object at 0x101bbae18> 1 1 1 1 1 1

… in other words, it's 1BYTE_KIND, length 1, mortal-interned, ASCII, compact, and ready.

But, if you look at ps.wstr, before Pandas it's a null pointer, while after Pandas it's a pointer to the wchar_t string w"a\0". And str.__sizeof__ takes that wstr size into account.


So, the question is, how do you end up with an ascii-compact string that has a wstr value?

Simple: you call PyUnicode_AsUnicode on it (or one of the other deprecated functions or macros that accesses the 3.2-style native wchar_t * internal storage. That native internal storage doesn't actually exist in 3.3+. So, for backward compatibility, those calls are handled by creating that storage on the fly, sticking it on the wstr member, and calling the appropriate PyUnicode_AsUCS[24] function to decode to that storage. (Unless you're dealing with a compact string whose kind happens to match the wchar_t width, in which case wstr is just a pointer to the native storage after all.)

You'd expect str.__sizeof__ to ideally include that extra storage, and from the source, you can see that it does.

Let's verify that:

import ctypes
import sys
s = 'a'
print(sys.getsizeof(s))
ctypes.pythonapi.PyUnicode_AsUnicode.argtypes = [ctypes.py_object]
ctypes.pythonapi.PyUnicode_AsUnicode.restype = ctypes.c_wchar_p
print(ctypes.pythonapi.PyUnicode_AsUnicode(s))
print(sys.getsizeof(s))

Tada, our 50 goes to 58.


So, how do you work out where this gets called?

There are actually a ton of calls to PyUnicode_AsUnicode, and the PyUnicode_AS_UNICODE macro, and other functions that call them, throughout Pandas and Numpy. So I ran Python in lldb and attached a breakpoint to PyUnicode_AsUnicode, with a script that skips if the calling stack frame is the same as last time.

The first few calls involve datetime formats. Then there's one with a single letter. And the stack frame is:

multiarray.cpython-36m-darwin.so`UNICODE_setitem + 296

… and above multiarray it's pure Python all the way up to the import pandas. So, if you want to know exactly where Pandas is calling this function, you'd need to debug in pdb, which I haven't done yet. But I think we've got enough info now.

like image 100
abarnert Avatar answered Oct 20 '22 00:10

abarnert