Why does python pickle load and dump inflate the size of of an object on disk?

Question

I have a pickled object in a file named b1.pkl:

$ ls -l b*
-rw-r--r--  1 fireball  staff  64743950 Oct 11 15:32 b1.pkl

Then I run the following python code to load the object and dump it to a new file:

import numpy as np
import cPickle as pkl

fin = open('b1.pkl', 'r')
fout = open('b2.pkl', 'w')

x = pkl.load(fin)
pkl.dump(x, fout)

fin.close()
fout.close()

The file this code creates is more than twice as large:

$ ls -l b*
-rw-r--r--  1 fireball  staff   64743950 Oct 11 15:32 b1.pkl
-rw-r--r--  1 fireball  staff  191763914 Oct 11 15:47 b2.pkl

Can anyone explain why the new file is so much larger than the original one? It should contain exactly the same structure.

root · Accepted Answer

It could be that the original pickle used some other protocol. For example try specifying protocol=2 as a keyword argument to the second pickle.dump and test it again. Binary pickle should be much smaller in size.

jdi · Answer

Most likely your original b1.pkl was pickled out using the more efficient protocol mode (1 or 2). So your file starts out smaller.

When you load in with cPickle, it will automatically detect the protocol for you from the file. But when you go and dump it out again with default args, it will use protocol 0 which is much larger. It does this for portability/compatibility. You are required to explicitly request the binary protocol.

import numpy as np
import cPickle

# random data
s = {}
for i in xrange(5000):
    s[i] = np.random.randn(5,5)

# pickle it out the first time with binary protocol
with open('data.pkl', 'wb') as f:
    cPickle.dump(s, f, 2)

# read it back in and pickle it out with default args
with open('data.pkl', 'rb') as f:
    with open('data2.pkl', 'wb') as o:
        s = cPickle.load(f)
        cPickle.dump(s, o)

$ ls -l
1174109 Oct 11 16:05 data.pkl
3243157 Oct 11 16:08 data2.pkl

Why does python pickle load and dump inflate the size of of an object on disk?

Tags:

python

serialization

pickle

load

dump

user1389890

2 Answers

root

jdi

Recent Activity

Donate For Us

Why does python pickle load and dump inflate the size of of an object on disk?

Tags:

python

serialization

pickle

load

dump

user1389890

2 Answers

root

jdi

Related questions

Recent Activity

Donate For Us