currently I'm working on image retrieval with Python. The keypoints and descriptors extracted from an image in this example are represented as numpy.array
s. The first one of shape (2000, 5) and the latter of shape (2000, 128). Both containing only values of dtype=numpy.float32
.
So, I was wondering which format to use in order to save my extracted keypoints and descriptors. I.e. I'm always saving 2 files: one for the keypoints and one for the descriptors - this counts as one step in my measurements. I compared pickle
, cPickle
(both with protocol 0 and 2) and NumPy's binary format .pny
and the results are really confusing me:
I always thought cPickle
is supposed to be faster than the pickle
module. But especially the load time with protocol 0 really sticks out in the results.
Does anyone have an explanation for this? Is it because I'm only using numeric data? Seems strange...
PS: In my code I'm basically looping 1000 times (number=1000
) over each technique and average the measured time in the end:
timer = time.time
print 'npy save...'
t0 = timer()
for i in range(number):
numpy.save(npy_kp_path, kp)
numpy.save(npy_descr_path, descr)
t1 = timer()
results['npy']['save'] = t1 - t0
print 'npy load...'
t0 = timer()
for i in range(number):
kp = numpy.load(npy_kp_path)
descr = numpy.load(npy_descr_path)
t1 = timer()
results['npy']['load'] = t1 - t0
print 'pickle protocol 0 save...'
t0 = timer()
for i in range(number):
with open(pkl0_descr_path, 'wb') as f:
pickle.dump(descr, f, protocol=0)
with open(pkl0_kp_path, 'wb') as f:
pickle.dump(kp, f, protocol=0)
t1 = timer()
results['pkl0']['save'] = t1 - t0
print 'pickle protocol 0 load...'
t0 = timer()
for i in range(number):
with open(pkl0_descr_path, 'rb') as f:
descr = pickle.load(f)
with open(pkl0_kp_path, 'rb') as f:
kp = pickle.load(f)
t1 = timer()
results['pkl0']['load'] = t1 - t0
print 'cPickle protocol 0 save...'
t0 = timer()
for i in range(number):
with open(cpkl0_descr_path, 'wb') as f:
cPickle.dump(descr, f, protocol=0)
with open(cpkl0_kp_path, 'wb') as f:
cPickle.dump(kp, f, protocol=0)
t1 = timer()
results['cpkl0']['save'] = t1 - t0
print 'cPickle protocol 0 load...'
t0 = timer()
for i in range(number):
with open(cpkl0_descr_path, 'rb') as f:
descr = cPickle.load(f)
with open(cpkl0_kp_path, 'rb') as f:
kp = cPickle.load(f)
t1 = timer()
results['cpkl0']['load'] = t1 - t0
print 'pickle highest protocol (2) save...'
t0 = timer()
for i in range(number):
with open(pkl2_descr_path, 'wb') as f:
pickle.dump(descr, f, protocol=pickle.HIGHEST_PROTOCOL)
with open(pkl2_kp_path, 'wb') as f:
pickle.dump(kp, f, protocol=pickle.HIGHEST_PROTOCOL)
t1 = timer()
results['pkl2']['save'] = t1 - t0
print 'pickle highest protocol (2) load...'
t0 = timer()
for i in range(number):
with open(pkl2_descr_path, 'rb') as f:
descr = pickle.load(f)
with open(pkl2_kp_path, 'rb') as f:
kp = pickle.load(f)
t1 = timer()
results['pkl2']['load'] = t1 - t0
print 'cPickle highest protocol (2) save...'
t0 = timer()
for i in range(number):
with open(cpkl2_descr_path, 'wb') as f:
cPickle.dump(descr, f, protocol=cPickle.HIGHEST_PROTOCOL)
with open(cpkl2_kp_path, 'wb') as f:
cPickle.dump(kp, f, protocol=cPickle.HIGHEST_PROTOCOL)
t1 = timer()
results['cpkl2']['save'] = t1 - t0
print 'cPickle highest protocol (2) load...'
t0 = timer()
for i in range(number):
with open(cpkl2_descr_path, 'rb') as f:
descr = cPickle.load(f)
with open(cpkl2_kp_path, 'rb') as f:
kp = cPickle.load(f)
t1 = timer()
results['cpkl2']['load'] = t1 - t0
Difference between Pickle and cPickle: Pickle uses python class-based implementation while cPickle is written as C functions. As a result, cPickle is many times faster than pickle.
Pickle is slow Pickle is both slower and produces larger serialized values than most of the alternatives. Pickle is the clear underperformer here. Even the 'cPickle' extension that's written in C has a serialization rate that's about a quarter that of JSON or Thrift.
The pickle module implements serialization protocol, which provides an ability to save and later load Python objects using special binary format. Unlike json , pickle is not limited to simple objects. It can also store references to functions and classes, as well as the state of class instances.
The (binary representation of) the numeric data of an ndarray
is pickled as one long string. It appears that cPickle
is indeed much slower than pickle
in unpickling large strings from protocol 0 files. Why? My guess is that pickle
makes use of well-tuned string algorithms from the standard library and cPickle
has fallen behind.
The observation above is from playing with Python 2.7. Python 3.3, which uses a C extension automatically, is faster than either module on Python 2.7, so apparently the issue has been fixed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With