Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pickle faster than cPickle with numeric data?

currently I'm working on image retrieval with Python. The keypoints and descriptors extracted from an image in this example are represented as numpy.arrays. The first one of shape (2000, 5) and the latter of shape (2000, 128). Both containing only values of dtype=numpy.float32.

So, I was wondering which format to use in order to save my extracted keypoints and descriptors. I.e. I'm always saving 2 files: one for the keypoints and one for the descriptors - this counts as one step in my measurements. I compared pickle, cPickle (both with protocol 0 and 2) and NumPy's binary format .pny and the results are really confusing me:

enter image description here

I always thought cPickle is supposed to be faster than the pickle module. But especially the load time with protocol 0 really sticks out in the results. Does anyone have an explanation for this? Is it because I'm only using numeric data? Seems strange...

PS: In my code I'm basically looping 1000 times (number=1000) over each technique and average the measured time in the end:

    timer = time.time

    print 'npy save...'
    t0 = timer()
    for i in range(number):
        numpy.save(npy_kp_path, kp)
        numpy.save(npy_descr_path, descr)
    t1 = timer()
    results['npy']['save'] = t1 - t0

    print 'npy load...'
    t0 = timer()
    for i in range(number):
        kp = numpy.load(npy_kp_path)
        descr = numpy.load(npy_descr_path)
    t1 = timer()
    results['npy']['load'] = t1 - t0


    print 'pickle protocol 0 save...'
    t0 = timer()
    for i in range(number):
        with open(pkl0_descr_path, 'wb') as f:
            pickle.dump(descr, f, protocol=0)
        with open(pkl0_kp_path, 'wb') as f:
            pickle.dump(kp, f, protocol=0)
    t1 = timer()
    results['pkl0']['save'] = t1 - t0

    print 'pickle protocol 0 load...'
    t0 = timer()
    for i in range(number):
        with open(pkl0_descr_path, 'rb') as f:
            descr = pickle.load(f)
        with open(pkl0_kp_path, 'rb') as f:
            kp = pickle.load(f)
    t1 = timer()
    results['pkl0']['load'] = t1 - t0


    print 'cPickle protocol 0 save...'
    t0 = timer()
    for i in range(number):
        with open(cpkl0_descr_path, 'wb') as f:
            cPickle.dump(descr, f, protocol=0)
        with open(cpkl0_kp_path, 'wb') as f:
            cPickle.dump(kp, f, protocol=0)
    t1 = timer()
    results['cpkl0']['save'] = t1 - t0

    print 'cPickle protocol 0 load...'
    t0 = timer()
    for i in range(number):
        with open(cpkl0_descr_path, 'rb') as f:
            descr = cPickle.load(f)
        with open(cpkl0_kp_path, 'rb') as f:
            kp = cPickle.load(f)
    t1 = timer()
    results['cpkl0']['load'] = t1 - t0


    print 'pickle highest protocol (2) save...'
    t0 = timer()
    for i in range(number):
        with open(pkl2_descr_path, 'wb') as f:
            pickle.dump(descr, f, protocol=pickle.HIGHEST_PROTOCOL)
        with open(pkl2_kp_path, 'wb') as f:
            pickle.dump(kp, f, protocol=pickle.HIGHEST_PROTOCOL)
    t1 = timer()
    results['pkl2']['save'] = t1 - t0

    print 'pickle highest protocol (2) load...'
    t0 = timer()
    for i in range(number):
        with open(pkl2_descr_path, 'rb') as f:
            descr = pickle.load(f)
        with open(pkl2_kp_path, 'rb') as f:
            kp = pickle.load(f)
    t1 = timer()
    results['pkl2']['load'] = t1 - t0


    print 'cPickle highest protocol (2) save...'
    t0 = timer()
    for i in range(number):
        with open(cpkl2_descr_path, 'wb') as f:
            cPickle.dump(descr, f, protocol=cPickle.HIGHEST_PROTOCOL)
        with open(cpkl2_kp_path, 'wb') as f:
            cPickle.dump(kp, f, protocol=cPickle.HIGHEST_PROTOCOL)
    t1 = timer()
    results['cpkl2']['save'] = t1 - t0

    print 'cPickle highest protocol (2) load...'
    t0 = timer()
    for i in range(number):
        with open(cpkl2_descr_path, 'rb') as f:
            descr = cPickle.load(f)
        with open(cpkl2_kp_path, 'rb') as f:
            kp = cPickle.load(f)
    t1 = timer()
    results['cpkl2']['load'] = t1 - t0 
like image 613
pklip Avatar asked May 30 '13 09:05

pklip


People also ask

Is cPickle faster than pickle?

Difference between Pickle and cPickle: Pickle uses python class-based implementation while cPickle is written as C functions. As a result, cPickle is many times faster than pickle.

Are pickle files fast?

Pickle is slow Pickle is both slower and produces larger serialized values than most of the alternatives. Pickle is the clear underperformer here. Even the 'cPickle' extension that's written in C has a serialization rate that's about a quarter that of JSON or Thrift.

What is pickle protocol?

The pickle module implements serialization protocol, which provides an ability to save and later load Python objects using special binary format. Unlike json , pickle is not limited to simple objects. It can also store references to functions and classes, as well as the state of class instances.


1 Answers

The (binary representation of) the numeric data of an ndarray is pickled as one long string. It appears that cPickle is indeed much slower than pickle in unpickling large strings from protocol 0 files. Why? My guess is that pickle makes use of well-tuned string algorithms from the standard library and cPickle has fallen behind.

The observation above is from playing with Python 2.7. Python 3.3, which uses a C extension automatically, is faster than either module on Python 2.7, so apparently the issue has been fixed.

like image 125
Janne Karila Avatar answered Sep 27 '22 20:09

Janne Karila