Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pickle file size when pickling numpy arrays or lists

I have thousands of tuples of long (8640) lists of integers. For example:

type(l1)
tuple

len(l1)
2

l1[0][:10]
[0, 31, 23, 0, 0, 0, 0, 0, 0, 0]

l1[1][:10]
[0, 0, 11, 16, 24, 0, 0, 0, 0, 0] 

I am "pickling" the tuples and it seems that when the tuples are of lists the pickle file is lighter than when are of numpy arrays. I am not that new to python, but by no means I am an expert and I don't really know how the memory is administrated for different types of objects. I would have expected numpy arrays to be lighter, but this is what I obtain when I pickle different types of objects:

#elements in the tuple as a numpy array
l2 = [np.asarray(l1[i]) for i in range(len(l1))]
l2
[array([ 0, 31, 23, ...,  2,  0,  0]), array([ 0,  0, 11, ...,  1,  0,  0])]

#integers in the array are small enough to be saved in two bytes
l3 = [np.asarray(l1[i], dtype='u2') for i in range(len(l1))]
l3
[array([ 0, 31, 23, ...,  2,  0,  0], dtype=uint16),
 array([ 0,  0, 11, ...,  1,  0,  0], dtype=uint16)]

#the original tuple of lists
with open('file1.pkl','w') as f:
     pickle.dump(l1, f)

#tuple of numpy arrays
with open('file2.pkl','w') as f:
    pickle.dump(l2, f)

#tuple of numpy arrays with integers as unsigned 2 bytes
with open('file3.pkl','w') as f:
    pickle.dump(l3, f)

and when I check the size of the files:

 $du -h file1.pkl
  72K   file1.pkl

 $du -h file2.pkl
  540K  file2.pkl

 $du -h file3.pkl
 136K   file3.pkl

So even when the integers are saved in two bytes file1 is lighter than file3. I would prefer to use arrays because decompressing arrays (and processing them) is much faster than lists. However, I am going to be storing lots of these tuples (in a pandas data frame) so I would also like to optimise memory as much as possible.

The way I need this to work is, given a list of tuples I do:

#list of pickle objects from pickle.dumps
tpl_pkl = [pickle.dumps(listoftuples[i]) for i in xrange(len(listoftuples))]

#existing pandas data frame. Inserting new column 
df['tuples'] = tpl_pkl

Overall my question is: Is there a reason why numpy arrays are taking more space than lists after pickling them into a file?

Maybe if I understand the reason I can find an optimal way of storing arrays.

Thanks in advance for your time.

like image 902
Javier Avatar asked Sep 09 '15 17:09

Javier


1 Answers

If you want to store numpy arrays on disk you shouldn't be using pickle at all. Investigate numpy.save() and its kin.

If you are using pandas then it too has its own methods. You might want to consult this article or the answer to this question for better techniques.

like image 77
holdenweb Avatar answered Sep 22 '22 01:09

holdenweb