pickle faster than cPickle with numeric data?

Tags:

currently I'm working on image retrieval with Python. The keypoints and descriptors extracted from an image in this example are represented as numpy.arrays. The first one of shape (2000, 5) and the latter of shape (2000, 128). Both containing only values of dtype=numpy.float32.

So, I was wondering which format to use in order to save my extracted keypoints and descriptors. I.e. I'm always saving 2 files: one for the keypoints and one for the descriptors - this counts as one step in my measurements. I compared pickle, cPickle (both with protocol 0 and 2) and NumPy's binary format .pny and the results are really confusing me:

enter image description here

I always thought cPickle is supposed to be faster than the pickle module. But especially the load time with protocol 0 really sticks out in the results. Does anyone have an explanation for this? Is it because I'm only using numeric data? Seems strange...

PS: In my code I'm basically looping 1000 times (number=1000) over each technique and average the measured time in the end:

    timer = time.time

    print 'npy save...'
    t0 = timer()
    for i in range(number):
        numpy.save(npy_kp_path, kp)
        numpy.save(npy_descr_path, descr)
    t1 = timer()
    results['npy']['save'] = t1 - t0

    print 'npy load...'
    t0 = timer()
    for i in range(number):
        kp = numpy.load(npy_kp_path)
        descr = numpy.load(npy_descr_path)
    t1 = timer()
    results['npy']['load'] = t1 - t0


    print 'pickle protocol 0 save...'
    t0 = timer()
    for i in range(number):
        with open(pkl0_descr_path, 'wb') as f:
            pickle.dump(descr, f, protocol=0)
        with open(pkl0_kp_path, 'wb') as f:
            pickle.dump(kp, f, protocol=0)
    t1 = timer()
    results['pkl0']['save'] = t1 - t0

    print 'pickle protocol 0 load...'
    t0 = timer()
    for i in range(number):
        with open(pkl0_descr_path, 'rb') as f:
            descr = pickle.load(f)
        with open(pkl0_kp_path, 'rb') as f:
            kp = pickle.load(f)
    t1 = timer()
    results['pkl0']['load'] = t1 - t0


    print 'cPickle protocol 0 save...'
    t0 = timer()
    for i in range(number):
        with open(cpkl0_descr_path, 'wb') as f:
            cPickle.dump(descr, f, protocol=0)
        with open(cpkl0_kp_path, 'wb') as f:
            cPickle.dump(kp, f, protocol=0)
    t1 = timer()
    results['cpkl0']['save'] = t1 - t0

    print 'cPickle protocol 0 load...'
    t0 = timer()
    for i in range(number):
        with open(cpkl0_descr_path, 'rb') as f:
            descr = cPickle.load(f)
        with open(cpkl0_kp_path, 'rb') as f:
            kp = cPickle.load(f)
    t1 = timer()
    results['cpkl0']['load'] = t1 - t0


    print 'pickle highest protocol (2) save...'
    t0 = timer()
    for i in range(number):
        with open(pkl2_descr_path, 'wb') as f:
            pickle.dump(descr, f, protocol=pickle.HIGHEST_PROTOCOL)
        with open(pkl2_kp_path, 'wb') as f:
            pickle.dump(kp, f, protocol=pickle.HIGHEST_PROTOCOL)
    t1 = timer()
    results['pkl2']['save'] = t1 - t0

    print 'pickle highest protocol (2) load...'
    t0 = timer()
    for i in range(number):
        with open(pkl2_descr_path, 'rb') as f:
            descr = pickle.load(f)
        with open(pkl2_kp_path, 'rb') as f:
            kp = pickle.load(f)
    t1 = timer()
    results['pkl2']['load'] = t1 - t0


    print 'cPickle highest protocol (2) save...'
    t0 = timer()
    for i in range(number):
        with open(cpkl2_descr_path, 'wb') as f:
            cPickle.dump(descr, f, protocol=cPickle.HIGHEST_PROTOCOL)
        with open(cpkl2_kp_path, 'wb') as f:
            cPickle.dump(kp, f, protocol=cPickle.HIGHEST_PROTOCOL)
    t1 = timer()
    results['cpkl2']['save'] = t1 - t0

    print 'cPickle highest protocol (2) load...'
    t0 = timer()
    for i in range(number):
        with open(cpkl2_descr_path, 'rb') as f:
            descr = cPickle.load(f)
        with open(cpkl2_kp_path, 'rb') as f:
            kp = cPickle.load(f)
    t1 = timer()
    results['cpkl2']['load'] = t1 - t0

613

asked May 30 '13 09:05

pklip

1 Answers

The (binary representation of) the numeric data of an ndarray is pickled as one long string. It appears that cPickle is indeed much slower than pickle in unpickling large strings from protocol 0 files. Why? My guess is that pickle makes use of well-tuned string algorithms from the standard library and cPickle has fallen behind.

The observation above is from playing with Python 2.7. Python 3.3, which uses a C extension automatically, is faster than either module on Python 2.7, so apparently the issue has been fixed.

125

answered Sep 27 '22 20:09

Janne Karila

Related questions
                            
                                Psycopg2 db connection hangs on lost network connection
                            
                                How to profile a Jinja2 template?
                            
                                Python - multiprocessing for matplotlib griddata
                            
                                How to place xaxis grid over spectrogram in Python?
                            
                                What is the status of Functional Reactive Programming in Python?
                            
                                Fast logarithm calculation
                            
                                Setting values on Pandas DataFrame subset (copy) is slow
                            
                                Raspberry BLE peripherals alternative to bleno
                            
                                Use cases for __init__.py in python 3.3+
                            
                                Relative import of package __init__.py
                            
                                Use prefetch_related in django_simple_history
                            
                                Fatal error in extension: PyThreadState_Get: no current thread
                            
                                Process finished with exit code -1073740791 (0xC0000409) PyCharm
                            
                                Saving list of many python variables into excel sheet while simultaneously keeping variable types defined?
                            
                                Python setup.py install specify extras_require
                            
                                Is it possible to execute server-side javascript from a *Python* Google App Engine instance?
                            
                                Evil in the python decimal / float
                            
                                python & maven (unit test integration)
                            
                                Flask authentication using LDAP
                            
                                String equality failure in Python. What gives?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pickle faster than cPickle with numeric data?

Tags:

python

numpy

pickle

pklip

People also ask

1 Answers

Janne Karila

Recent Activity

Donate For Us