Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are Unicode strings having a different memory footprint in Python 2 and 3? [duplicate]

In Python 2, an empty string occupy exactly 37 bytes,

>>>> print sys.getsizeof('')
37

In Python 3.6, the same call would output 49 bytes,

>>>> print(sys.getsizeof(''))
49

Now I thought that this was due to the fact that in Python 3, all strings are now Unicode. But, to my surprise here are some confusing outputs,

Python 2.7

>>>> print sys.getsizeof(u'')
52
>>>> print sys.getsizeof(u'1')
56

Python 3.6

>>>>print(sys.getsizeof(''))
49
>>>>print(sys.getsizeof('1'))
50
  1. An empty string is not the same size.
  2. 4 additional bytes are needed when adding a character in Python 2 and only one for Python 3

Why is the memory footprint different between the two versions ?

EDIT

I specified the exact version of my python environment, because between different Python 3 builds there are differences.

like image 969
scharette Avatar asked Mar 05 '23 16:03

scharette


1 Answers

There are reasons of course, but really, it should not matter for any practical purposes. If you have a Python system in which you have to keep so many strings in memory as to get close to the system memory, you should optimize it by (1) trying to lazily load/create strings in memory or (2) Use a byte-oriented efficient binary structure to deal with your data, such as those provided by Numpy, or Python's own bytearray.

The change for the empty string literal (unicode literal fro Py2) could bedue to any implementation details between the versions you are looking at, which should not matter even if were writting C code to interact directly with Python strings: even those should only touch the strings via the API.

Now, the specific reason for why the string in Python 3 just increases its size by "1" byte, while in Python 2 it increases the size by 4 bytes is PEP 393.

Before Python 3.3, any (unicode) string in Python would use either fixed 2 bytes or fixed 4 bytes of memory for each character - and the Python interpreter and Python modules using native code would have to be compiled to use just one of these kinds. I.E. you efectively could have incompatible Python binaries, even if the versions matched, due to the string-width optoin picked up at build time - the builds were known as "narrow build" and "wide build". With the above mentioned PEP 391, Python strings have their character sized determined when they are instantiate, depending on the size of the widest unicode codepoint it contains. Strings that contain points that are contained in the first 256 codepoints (equivalent to the Latin-1 character set) use only 1 byte per character.

like image 127
jsbueno Avatar answered Mar 08 '23 22:03

jsbueno