I have a pickled object in a file named b1.pkl:
$ ls -l b*
-rw-r--r-- 1 fireball staff 64743950 Oct 11 15:32 b1.pkl
Then I run the following python code to load the object and dump it to a new file:
import numpy as np
import cPickle as pkl
fin = open('b1.pkl', 'r')
fout = open('b2.pkl', 'w')
x = pkl.load(fin)
pkl.dump(x, fout)
fin.close()
fout.close()
The file this code creates is more than twice as large:
$ ls -l b*
-rw-r--r-- 1 fireball staff 64743950 Oct 11 15:32 b1.pkl
-rw-r--r-- 1 fireball staff 191763914 Oct 11 15:47 b2.pkl
Can anyone explain why the new file is so much larger than the original one? It should contain exactly the same structure.
It could be that the original pickle used some other protocol. For example try specifying protocol=2
as a keyword argument to the second pickle.dump
and test it again. Binary pickle should be much smaller in size.
Most likely your original b1.pkl
was pickled out using the more efficient protocol mode (1 or 2). So your file starts out smaller.
When you load in with cPickle, it will automatically detect the protocol for you from the file. But when you go and dump it out again with default args, it will use protocol 0 which is much larger. It does this for portability/compatibility. You are required to explicitly request the binary protocol.
import numpy as np
import cPickle
# random data
s = {}
for i in xrange(5000):
s[i] = np.random.randn(5,5)
# pickle it out the first time with binary protocol
with open('data.pkl', 'wb') as f:
cPickle.dump(s, f, 2)
# read it back in and pickle it out with default args
with open('data.pkl', 'rb') as f:
with open('data2.pkl', 'wb') as o:
s = cPickle.load(f)
cPickle.dump(s, o)
$ ls -l
1174109 Oct 11 16:05 data.pkl
3243157 Oct 11 16:08 data2.pkl
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With