Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python memory serialisation

I was wondering whether someone might know the answer to the following.

I'm using Python to build a character-based suffix tree. There are over 11 million nodes in the tree which fits in to approximately 3GB of memory. This was down from 7GB by using the slot class method rather than the Dict method.

When I serialise the tree (using the highest protocol) the resulting file is more than a hundred times smaller.

When I load the pickled file back in, it again consumes 3GB of memory. Where does this extra overhead come from, is it something to do with Pythons handling of memory references to class instances?

Update

Thank you larsmans and Gurgeh for your very helpful explanations and advice. I'm using the tree as part of an information retrieval interface over a corpus of texts.

I originally stored the children (max of 30) as a Numpy array, then tried the hardware version (ctypes.py_object*30), the Python array (ArrayType), as well as the dictionary and Set types.

Lists seemed to do better (using guppy to profile the memory, and __slots__['variable',...]), but I'm still trying to squash it down a bit more if I can. The only problem I had with arrays is having to specify their size in advance, which causes a bit of redundancy in terms of nodes with only one child, and I have quite a lot of them. ;-)

After the tree is constructed I intend to convert it to a probabilistic tree with a second pass, but may be I can do this as the tree is constructed. As construction time is not too important in my case, the array.array() sounds like something that would be useful to try, thanks for the tip, really appreciated.

I'll let you know how it goes.

like image 502
Martyn Avatar asked May 18 '11 07:05

Martyn


People also ask

What is memory serialization?

serialization is the process of translating data structures or object state into a format that can be stored (for example, in a file or memory buffer) or transmitted (for example, across a network connection link) and reconstructed later (possibly in a different computer environment).

What is Python serialization?

Serialization refers to the process of converting a data object (e.g., Python objects, Tensorflow models) into a format that allows us to store or transmit the data and then recreate the object when needed using the reverse process of deserialization.

What is serializing and Deserializing Python?

Object serialization is the process of converting state of an object into byte stream. This byte stream can further be stored in any file-like object such as a disk file or memory stream. It can also be transmitted via sockets etc. Deserialization is the process of reconstructing the object from the byte stream.

Why pickle is used in Python?

Pickle in Python is primarily used in serializing and deserializing a Python object structure. In other words, it's the process of converting a Python object into a byte stream to store it in a file/database, maintain program state across sessions, or transport data over the network.


2 Answers

If you try to pickle an empty list, you get:

>>> s = StringIO()
>>> pickle.dump([], s)
>>> s.getvalue()
'(l.'

and similarly '(d.' for an empty dict. That's three bytes. The in-memory representation of a list, however, contains

  • a reference count
  • a type ID, in turn containing a pointer to the type name and bookkeeping info for memory allocation
  • a pointer to a vector of pointers to actual elements
  • and yet more bookkeeping info.

On my machine, which has 64-bit pointers, the sizeof a Python list header object is 40 bytes, so that's one order of magnitude. I assume an empty dict will have similar size.

Then, both list and dict use an overallocation strategy to obtain amortized O(1) performance for their main operations, malloc introduces overhead, there's alignment, member attributes that you may or may not even be aware of and various other factors that get you the second order of magnitude.

Summing up: pickle is a pretty good compression algorithm for Python objects :)

like image 166
Fred Foo Avatar answered Oct 17 '22 18:10

Fred Foo


Do you construct your tree once and then use it without modifying it further? In that case you might want to consider using separate structures for the dynamic construction and the static usage.

Dicts and objects are very good for dynamic modification, but they are not very space efficient in a read-only scenario. I don't know exactly what you are using your suffix tree for, but you could let each node be represented by a 2-tuple of a sorted array.array('c') and an equally long tuple of subnodes (a tuple instead of a vector to avoid overallocation). You traverse the tree using the bisect-module for lookup in the array. The index of a character in the array will correspond to a subnode in the subnode-tuple. This way you avoid dicts, objects and vector.

You could do something similar during the construction process, perhaps using a subnode-vector instead of subnode-tuple. But this will of course make construction slower, since inserting new nodes in a sorted vector is O(N).

like image 37
Gurgeh Avatar answered Oct 17 '22 18:10

Gurgeh