Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading pickled python object use enormous amount of memory

I am having a python's pickled object which generates a 180 Mb file. When I unpickle it, the memory usage explode to 2 or 3Gb. Do you have similar experience? Is it normal?

The object is a tree containing a dictionary : each edge is a letter, and each node is a potential word. So to store a word you need as much edges as the length of this word. So, the first level is 26 node maximum, the second one is 26^2, the third 26^3, etc... For each node being a word I have an attribute pointing toward the informations about the word (verb, noun, definition, etc...).

I have words of about 40 characters maximum. I have around half a million entry. Everything goes fine till I pickle (using a simple cpickle dump) : it gives a 180 Mb file. I am on Mac OS, and when I unpickle these 180 Mb, the OS give 2 or 3 Gb of "memory / virtual memory" to the python process :(

I don't see any recursion on this tree : the edges have nodes having themselves an array of array. No recursion involved.

I am a bit stuck : the loading of these 180 Mb is around 20 sec (not speaking about the memory issue). I have to say my CPU is not that fast : core i5, 1.3Ghz. But my hard drive is an ssd. I only have 4Gb of memory.

To add these 500 000 word in my tree, I read about 7 000 files containing each one about 100 words. Making this reading make the memory allocated by mac os going up to 15 Gb, mainly on virtual memory :( I have been using the "with" statement ensuring the closing of each file, but doesn't really help. Reading a file take around 0.2 sec for 40 Ko. Seems quite long to me. Adding it to the tree is much faster (0.002 sec).

Finally I wanted to make an object database, but I guess python is not suitable to that. Maybe I will go for a MongoDB :(

class Trie():
    """
    Class to store known entities / word / verbs...
    """
    longest_word  = -1
    nb_entree     = 0

    def __init__(self):
        self.children    = {}
        self.isWord      = False
        self.infos       =[]

    def add(self, orthographe, entree):  
        """
        Store a string with the given type and definition in the Trie structure.
        """
        if len(orthographe) >Trie.longest_word:
            Trie.longest_word = len(orthographe)

        if len(orthographe)==0:
            self.isWord = True
            self.infos.append(entree)
            Trie.nb_entree += 1
            return True

        car = orthographe[0]
        if car not in self.children.keys():
            self.children[car] = Trie()

        self.children[car].add(orthographe[1:], entree)
like image 308
Romain Jouin Avatar asked Aug 14 '14 14:08

Romain Jouin


2 Answers

Python objects, especially on a 64-bit machine, are very big. When pickled, an object gets a very compact representation that is suitable for a disk file. Here's an example of a disassembled pickle:

>>> pickle.dumps({'x':'y','z':{'x':'y'}},-1)
'\x80\x02}q\x00(U\x01xq\x01U\x01yq\x02U\x01zq\x03}q\x04h\x01h\x02su.'
>>> pickletools.dis(_)
    0: \x80 PROTO      2
    2: }    EMPTY_DICT
    3: q    BINPUT     0
    5: (    MARK
    6: U        SHORT_BINSTRING 'x'
    9: q        BINPUT     1
   11: U        SHORT_BINSTRING 'y'
   14: q        BINPUT     2
   16: U        SHORT_BINSTRING 'z'
   19: q        BINPUT     3
   21: }        EMPTY_DICT
   22: q        BINPUT     4
   24: h        BINGET     1
   26: h        BINGET     2
   28: s        SETITEM
   29: u        SETITEMS   (MARK at 5)
   30: .    STOP

As you can see, it is very compact. Nothing is repeated if it is possible.

When in memory, however, an object consists of a fairly sizable number of pointers. Let's ask Python how big an empty dictionary is (64-bit machine):

>>> {}.__sizeof__()
248

Wow! 248 bytes for an empty dictionary! Note that the dictionary comes pre-allocated with room for up to eight elements. However, you pay the same memory cost even if you have one element in the dictionary.

A class instance contains one dictionary to hold the instance variables. Your tries have an additional dictionary for the children. So, each instance costs you nearly 500 bytes. With an estimated 2-4 million Trie objects, you can easily see where your memory usage comes from.


You can mitigate this a bit by adding a __slots__ to your Trie to eliminate the instance dictionary. You'll probably save about 750MB by doing this (my guess). It will prevent you from being able to add more variables to the Trie, but this is probably not a huge problem.

like image 151
nneonneo Avatar answered Sep 21 '22 00:09

nneonneo


Do you really need it to load or dump all of it in memory all at once? If you don't need all of it in memory, but only the select parts you want at any given time, you may want to map your dictionary to a set of files on disk instead of a single file… or map the dict to a database table. So, if you are looking for something that saves large dictionaries of data to disk or to a database, and can utilize pickling and encoding (codecs and hashmaps), then you might want to look at klepto.

klepto provides a dictionary abstraction for writing to a database, including treating your filesystem as a database (i.e. writing the entire dictionary to a single file, or writing each entry to it's own file). For large data, I often choose to represent the dictionary as a directory on my filesystem, and have each entry be a file. klepto also offers caching algorithms, so if you are using a filesystem backend for the dictionary you can avoid some speed penalty by utilizing memory caching.

>>> from klepto.archives import dir_archive
>>> d = {'a':1, 'b':2, 'c':map, 'd':None}
>>> # map a dict to a filesystem directory
>>> demo = dir_archive('demo', d, serialized=True) 
>>> demo['a']
1
>>> demo['c']
<built-in function map>
>>> demo          
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> # is set to cache to memory, so use 'dump' to dump to the filesystem 
>>> demo.dump()
>>> del demo
>>> 
>>> demo = dir_archive('demo', {}, serialized=True)
>>> demo
dir_archive('demo', {}, cached=True)
>>> # demo is empty, load from disk
>>> demo.load()
>>> demo
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> demo['c']
<built-in function map>
>>> 

klepto also has other flags such as compression and memmode that can be used to customize how your data is stored (e.g. compression level, memory map mode, etc). It's equally easy (the same exact interface) to use a (MySQL, etc) database as a backend instead of your filesystem. You can also turn off memory caching, so every read/write goes directly to the archive, simply by setting cached=False.

klepto also provides a lot of caching algorithms (like mru, lru, lfu, etc), to help you manage your in-memory cache, and will use the algorithm do the dump and load to the archive backend for you.

You can use the flag cached=False to turn off memory caching completely, and directly read and write to and from disk or database. If your entries are large enough, you might pick to write to disk, where you put each entry in it's own file. Here's an example that does both.

>>> from klepto.archives import dir_archive
>>> # does not hold entries in memory, each entry will be stored on disk
>>> demo = dir_archive('demo', {}, serialized=True, cached=False)
>>> demo['a'] = 10
>>> demo['b'] = 20
>>> demo['c'] = min
>>> demo['d'] = [1,2,3]

However while this should greatly reduce load time, it might slow overall execution down a bit… it's usually better to specify the maximum amount to hold in memory cache and pick a good caching algorithm. You have to play with it to get the right balance for your needs.

Get klepto here: https://github.com/uqfoundation

like image 40
Mike McKerns Avatar answered Sep 21 '22 00:09

Mike McKerns