Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does python pickle load and dump inflate the size of of an object on disk?

I have a pickled object in a file named b1.pkl:

$ ls -l b*
-rw-r--r--  1 fireball  staff  64743950 Oct 11 15:32 b1.pkl

Then I run the following python code to load the object and dump it to a new file:

import numpy as np
import cPickle as pkl

fin = open('b1.pkl', 'r')
fout = open('b2.pkl', 'w')

x = pkl.load(fin)
pkl.dump(x, fout)

fin.close()
fout.close()

The file this code creates is more than twice as large:

$ ls -l b*
-rw-r--r--  1 fireball  staff   64743950 Oct 11 15:32 b1.pkl
-rw-r--r--  1 fireball  staff  191763914 Oct 11 15:47 b2.pkl

Can anyone explain why the new file is so much larger than the original one? It should contain exactly the same structure.

like image 943
user1389890 Avatar asked Oct 11 '12 22:10

user1389890


2 Answers

It could be that the original pickle used some other protocol. For example try specifying protocol=2 as a keyword argument to the second pickle.dump and test it again. Binary pickle should be much smaller in size.

like image 51
root Avatar answered Sep 27 '22 21:09

root


Most likely your original b1.pkl was pickled out using the more efficient protocol mode (1 or 2). So your file starts out smaller.

When you load in with cPickle, it will automatically detect the protocol for you from the file. But when you go and dump it out again with default args, it will use protocol 0 which is much larger. It does this for portability/compatibility. You are required to explicitly request the binary protocol.

import numpy as np
import cPickle

# random data
s = {}
for i in xrange(5000):
    s[i] = np.random.randn(5,5)

# pickle it out the first time with binary protocol
with open('data.pkl', 'wb') as f:
    cPickle.dump(s, f, 2)

# read it back in and pickle it out with default args
with open('data.pkl', 'rb') as f:
    with open('data2.pkl', 'wb') as o:
        s = cPickle.load(f)
        cPickle.dump(s, o)

$ ls -l
1174109 Oct 11 16:05 data.pkl
3243157 Oct 11 16:08 data2.pkl
like image 30
jdi Avatar answered Sep 27 '22 23:09

jdi