I have thousands of tuples of long (8640) lists of integers. For example:
type(l1)
tuple
len(l1)
2
l1[0][:10]
[0, 31, 23, 0, 0, 0, 0, 0, 0, 0]
l1[1][:10]
[0, 0, 11, 16, 24, 0, 0, 0, 0, 0]
I am "pickling" the tuples and it seems that when the tuples are of lists the pickle file is lighter than when are of numpy arrays. I am not that new to python, but by no means I am an expert and I don't really know how the memory is administrated for different types of objects. I would have expected numpy arrays to be lighter, but this is what I obtain when I pickle different types of objects:
#elements in the tuple as a numpy array
l2 = [np.asarray(l1[i]) for i in range(len(l1))]
l2
[array([ 0, 31, 23, ..., 2, 0, 0]), array([ 0, 0, 11, ..., 1, 0, 0])]
#integers in the array are small enough to be saved in two bytes
l3 = [np.asarray(l1[i], dtype='u2') for i in range(len(l1))]
l3
[array([ 0, 31, 23, ..., 2, 0, 0], dtype=uint16),
array([ 0, 0, 11, ..., 1, 0, 0], dtype=uint16)]
#the original tuple of lists
with open('file1.pkl','w') as f:
pickle.dump(l1, f)
#tuple of numpy arrays
with open('file2.pkl','w') as f:
pickle.dump(l2, f)
#tuple of numpy arrays with integers as unsigned 2 bytes
with open('file3.pkl','w') as f:
pickle.dump(l3, f)
and when I check the size of the files:
$du -h file1.pkl
72K file1.pkl
$du -h file2.pkl
540K file2.pkl
$du -h file3.pkl
136K file3.pkl
So even when the integers are saved in two bytes file1 is lighter than file3. I would prefer to use arrays because decompressing arrays (and processing them) is much faster than lists. However, I am going to be storing lots of these tuples (in a pandas data frame) so I would also like to optimise memory as much as possible.
The way I need this to work is, given a list of tuples I do:
#list of pickle objects from pickle.dumps
tpl_pkl = [pickle.dumps(listoftuples[i]) for i in xrange(len(listoftuples))]
#existing pandas data frame. Inserting new column
df['tuples'] = tpl_pkl
Overall my question is: Is there a reason why numpy arrays are taking more space than lists after pickling them into a file?
Maybe if I understand the reason I can find an optimal way of storing arrays.
Thanks in advance for your time.
If you want to store numpy arrays on disk you shouldn't be using pickle
at all. Investigate numpy.save()
and its kin.
If you are using pandas
then it too has its own methods. You might want to consult this article or the answer to this question for better techniques.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With