I have a large structure of primitive types within nested dict/list. The structure is quite complicated and doesn't really matter. If I represent it in python's built-in types (<code>dict</code>/<code>list</code>/<code>float</code>/<code>int</code>/<code>str</code>) it takes 1.1 GB, but if I store it in <code>protobuf</code> and load it to memory it is significantly smaller. ~250 MB total. I'm wondering how can this be. Are the built-in types in python inefficient in comparison to some external library? Edit: The structure is loaded from json file. So no internal references between objects

"Simple" python objects, such as <code>int</code> or <code>float</code>, need much more memory than their C-counterparts used by <code>protobuf</code>. Let's take a <code>list</code> of Python integers as example compared to an array of integers, as for example in an <code>array.array</code> (i.e. <code>array.array('i', ...)</code>). The analysis for <code>array.array</code> is simple: discarding some overhead from the <code>array.arrays</code>-object, only 4 bytes (size of a C-integer) are needed per element. The situation is completely different for a list of integers: <ul> <li>the list holds not the integer-objects themselves but pointers to the objects (<code>8</code> additional bytes for a 64bit executable)</li> <li>even a small non-zero integer needs at least <code>28</code> bytes (see <code>import sys; sys.getsizeof(1)</code> returns 28): 8 bytes are needed for reference counting, 8 bytes to hold a pointer to the integer-function table, 8 bytes are needed for the size of the integer value (Python's integers can be much bigger than 2^32), and at least 4 byte to hold the integer value itself.</li> <li>there is also an overhead for memory management of 4.5 bytes.</li> </ul> This means there is a whopping cost of 40.5 bytes per Python integer compared to the possible 4 bytes (or 8 bytes if we use <code>long long int</code>, i.e. 64bit integers). A situation is similar for a list with Python floats compared to an array of <code>doubles</code>( i.e. <code>array.array('d',...)</code>), which only needs about 8 bytes per element. But for list we have: <ul> <li>the list holds not the float objects themselves but pointers to the objects (<code>8</code> additional bytes for a 64bit executable)</li> <li>a float object needs <code>24</code> bytes (see <code>import sys; sys.getsizeof(1.0)</code> returns 24): 8 bytes are needed for reference counting, 8 bytes to hold a pointer to the float-function table, and 8 bytes to hold the <code>double</code>-value itself.</li> <li>because 24 is a multiple of 8, the overhead for memory management is "only" about 0.5 bytes.</li> </ul> Which means 32.5 bytes for a Python float object vs. 8 byte for a C-double. <code>protobuf</code> uses internally the same representation of the data as <code>array.array</code> and thus needs much less memory (about 4-5 times less, as you observe). <code>numpy.array</code> is another example for a data type, which holds raw C-values and thus needs much less memory than lists. <hr> If one doesn't need to search in a dictionary, then saving the key-values-pairs in a list will need less memory than in a dictionary, because one doesn't have to maintain a structure for searching (which imposes some memory costs) - this is also another thing that leads to smaller memory footprint of <code>protobuf</code>-data. <hr> To answer your other question: There are no built-in modules which are to Python-<code>dict</code>, what <code>array.array</code> are to Python-<code>list</code>, so I use this opportunity to shamelessly plug-in an advertisement for a library of mine: <code>cykhash</code>. Sets and maps from <code>cykhash</code> need less than 25% of Python'S-<code>dict</code>/<code>set</code> memory but are about the same fast.

Why protobuf is smaller in memory than normal dict+list in python?

1 Answers

"Simple" python objects, such as int or float, need much more memory than their C-counterparts used by protobuf.

Let's take a list of Python integers as example compared to an array of integers, as for example in an array.array (i.e. array.array('i', ...)).

The analysis for array.array is simple: discarding some overhead from the array.arrays-object, only 4 bytes (size of a C-integer) are needed per element.

The situation is completely different for a list of integers:

the list holds not the integer-objects themselves but pointers to the objects (8 additional bytes for a 64bit executable)
even a small non-zero integer needs at least 28 bytes (see import sys; sys.getsizeof(1) returns 28): 8 bytes are needed for reference counting, 8 bytes to hold a pointer to the integer-function table, 8 bytes are needed for the size of the integer value (Python's integers can be much bigger than 2^32), and at least 4 byte to hold the integer value itself.
there is also an overhead for memory management of 4.5 bytes.

This means there is a whopping cost of 40.5 bytes per Python integer compared to the possible 4 bytes (or 8 bytes if we use long long int, i.e. 64bit integers).

A situation is similar for a list with Python floats compared to an array of doubles( i.e. array.array('d',...)), which only needs about 8 bytes per element. But for list we have:

the list holds not the float objects themselves but pointers to the objects (8 additional bytes for a 64bit executable)
a float object needs 24 bytes (see import sys; sys.getsizeof(1.0) returns 24): 8 bytes are needed for reference counting, 8 bytes to hold a pointer to the float-function table, and 8 bytes to hold the double-value itself.
because 24 is a multiple of 8, the overhead for memory management is "only" about 0.5 bytes.

Which means 32.5 bytes for a Python float object vs. 8 byte for a C-double.

protobuf uses internally the same representation of the data as array.array and thus needs much less memory (about 4-5 times less, as you observe). numpy.array is another example for a data type, which holds raw C-values and thus needs much less memory than lists.

If one doesn't need to search in a dictionary, then saving the key-values-pairs in a list will need less memory than in a dictionary, because one doesn't have to maintain a structure for searching (which imposes some memory costs) - this is also another thing that leads to smaller memory footprint of protobuf-data.

To answer your other question: There are no built-in modules which are to Python-dict, what array.array are to Python-list, so I use this opportunity to shamelessly plug-in an advertisement for a library of mine: cykhash.

Sets and maps from cykhash need less than 25% of Python'S-dict/set memory but are about the same fast.

115

answered Nov 03 '22 01:11

ead

Related questions
                            
                                Error when upgrading Spyder to 4.0.1: ModuleNotFoundError: No module named 'IPython.core.inputtransformer2'
                            
                                Python - Detect a QR code from an image and crop using OpenCV
                            
                                Accessing dict "values" like an attribute?
                            
                                How can I pass a callback to re.sub, but still inserting match captures?
                            
                                How can I list all the virtual environments created with venv?
                            
                                Implementing a trainable generalized Bump function layer in Keras/Tensorflow
                            
                                Named Entity Recognition in aspect-opinion extraction using dependency rule matching
                            
                                How to extract specific columns without index no. and with all the rows in python dataframe?
                            
                                rio.plot.show with colorbar?
                            
                                How does mypy use typing.TYPE_CHECKING to resolve the circular import annotation problem?
                            
                                Number of combinations less than 100
                            
                                dash app refusing to start: '127.0.0.1 refused to connect.'
                            
                                Word Cloud built out of TF-IDF Vectorizer function
                            
                                Why does a nested loop perform much faster than the flattened one?
                            
                                How do I fix a deprecated module for plotly.plotly
                            
                                Why do I have "ModuleNotFoundError: No module named 'scipy.special.cython_special'" when I don't even use cython?
                            
                                breakpoint in except clause doesn't have access to the bound exception
                            
                                Building a table with the data from scratch Python
                            
                                How to build list of tasks for asyncio.gather in Python 3.8
                            
                                How to change the number or rows and columns in my seaborn catplot

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why protobuf is smaller in memory than normal dict+list in python?

Tags:

python

json

python-3.x

protocol-buffers

user972014

People also ask

1 Answers

ead

Recent Activity

Donate For Us