Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why protobuf is smaller in memory than normal dict+list in python?

I have a large structure of primitive types within nested dict/list. The structure is quite complicated and doesn't really matter.

If I represent it in python's built-in types (dict/list/float/int/str) it takes 1.1 GB, but if I store it in protobuf and load it to memory it is significantly smaller. ~250 MB total.

I'm wondering how can this be. Are the built-in types in python inefficient in comparison to some external library?

Edit: The structure is loaded from json file. So no internal references between objects

like image 817
user972014 Avatar asked Aug 16 '20 16:08

user972014


People also ask

How does Python implement Protobuf?

We can encode messages into binary format using SerializeToString() and ParseFromString(). In the above code, we have serialized data of bytes and write into to file using the wb flags. We can read this file binary file and parse it using ParseFromString() method. It will return bytes representation of string.

What is Protobuf in Python?

Protocol buffers (Protobuf) are a language-agnostic data serialization format developed by Google. Protobuf is great for the following reasons: Low data volume: Protobuf makes use of a binary format, which is more compact than other formats such as JSON. Persistence: Protobuf serialization is backward-compatible.

Does Protobuf support inheritance?

Protobuf doesn't support inheritance. Having a common header and using composition is the best solution. You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.

What is the use of Google Protobuf?

Protocol Buffers (Protobuf) is a free and open-source cross-platform data format used to serialize structured data. It is useful in developing programs to communicate with each other over a network or for storing data.


1 Answers

"Simple" python objects, such as int or float, need much more memory than their C-counterparts used by protobuf.

Let's take a list of Python integers as example compared to an array of integers, as for example in an array.array (i.e. array.array('i', ...)).

The analysis for array.array is simple: discarding some overhead from the array.arrays-object, only 4 bytes (size of a C-integer) are needed per element.

The situation is completely different for a list of integers:

  • the list holds not the integer-objects themselves but pointers to the objects (8 additional bytes for a 64bit executable)
  • even a small non-zero integer needs at least 28 bytes (see import sys; sys.getsizeof(1) returns 28): 8 bytes are needed for reference counting, 8 bytes to hold a pointer to the integer-function table, 8 bytes are needed for the size of the integer value (Python's integers can be much bigger than 2^32), and at least 4 byte to hold the integer value itself.
  • there is also an overhead for memory management of 4.5 bytes.

This means there is a whopping cost of 40.5 bytes per Python integer compared to the possible 4 bytes (or 8 bytes if we use long long int, i.e. 64bit integers).

A situation is similar for a list with Python floats compared to an array of doubles( i.e. array.array('d',...)), which only needs about 8 bytes per element. But for list we have:

  • the list holds not the float objects themselves but pointers to the objects (8 additional bytes for a 64bit executable)
  • a float object needs 24 bytes (see import sys; sys.getsizeof(1.0) returns 24): 8 bytes are needed for reference counting, 8 bytes to hold a pointer to the float-function table, and 8 bytes to hold the double-value itself.
  • because 24 is a multiple of 8, the overhead for memory management is "only" about 0.5 bytes.

Which means 32.5 bytes for a Python float object vs. 8 byte for a C-double.

protobuf uses internally the same representation of the data as array.array and thus needs much less memory (about 4-5 times less, as you observe). numpy.array is another example for a data type, which holds raw C-values and thus needs much less memory than lists.


If one doesn't need to search in a dictionary, then saving the key-values-pairs in a list will need less memory than in a dictionary, because one doesn't have to maintain a structure for searching (which imposes some memory costs) - this is also another thing that leads to smaller memory footprint of protobuf-data.


To answer your other question: There are no built-in modules which are to Python-dict, what array.array are to Python-list, so I use this opportunity to shamelessly plug-in an advertisement for a library of mine: cykhash.

Sets and maps from cykhash need less than 25% of Python'S-dict/set memory but are about the same fast.

like image 115
ead Avatar answered Nov 03 '22 01:11

ead