Why is dill much faster and more disk-efficient than pickle for numpy arrays

Tags:

I'm using Python 2.7 and NumPy 1.11.2, as well as the latest versions of dill ( I just did the pip install dill) , on Ubuntu 16.04.

When storing a NumPy array using pickle, I find that pickle is very slow, and stores arrays at almost three times the 'necessary' size.

For example, in the following code, pickle is approximately 50 times slower (1s versus 50s), and creates a file that is 2.2GB instead of 800MB.

 import numpy 
 import pickle
 import dill
 B=numpy.random.rand(10000,10000)
 with open('dill','wb') as fp:
    dill.dump(B,fp)
 with open('pickle','wb') as fp:
    pickle.dump(B,fp)

I thought dill was just a wrapper around pickle. If this is true, is there a way that I can improve the performance of pickle myself? Is it generally not advisable to use pickle for NumPy arrays?

EDIT: Using Python3, I get the same performance for pickle and dill

PS: I know about numpy.save, but I am working in a framework where I store lots of different objects, all residing in a dictionary, to a file.

233

asked Jun 22 '17 10:06

Bananach

2 Answers

I'm the dill author. dill is an extension of pickle, but it does add some alternate pickling methods for numpy and other objects. For example, dill leverages the numpy methods for the pickling of arrays.

Additionally, (I believe) dill uses DEFAULT_PROTOCOL by default (not HIGHEST_PROTOCOL), for python3, and for python2 it uses HIGHEST_PROTOCOL by default.

answered Sep 20 '22 14:09

Mike McKerns

This ought to be a comment, but I have not enough reputation... My guess is that this is due to the pickle protocol used.

On Python 2, the default protocol is 0 and highest supported protocol is 2. On Python 3, the default protocol is 3 and highest supported protocol is 4 (as of Python 3.6).

Each protocol version improves on the previous one, but protocol 0 is especially slow for largish objects. It should be avoided in most cases, except if you need to be able to read your pickles using extremely old versions of Python. Protocol 2 is already much better.

Now, I suppose dill uses pickle.HIGHEST_PROTOCOL by default, and if that is indeed the case, it would probably be the cause of a good deal of the speed difference. You could try using pickle.HIGHEST_PROTOCOL to see if you get similar performance using dill and standard pickle.

with open('dill', 'wb') as fp:
    dill.dump(B, fp, protocol=pickle.HIGHEST_PROTOCOL)
with open('pickle', 'wb') as fp:
    pickle.dump(B, fp, protocol=pickle.HIGHEST_PROTOCOL)

answered Sep 21 '22 14:09

Gaëtan de Menten

Related questions
                            
                                Python docstrings and inline code; meaning of the ">>>" syntax
                            
                                how to extract days as integers from a timedelta64[ns] object in python
                            
                                How to strip newline from shell command's standard output run via ansible
                            
                                "ImportError: No module named..." when importing my own module
                            
                                Paramiko / scp - check if file exists on remote host
                            
                                Get serializer field value in api-view
                            
                                Datetime strptime in Python pandas : what's wrong?
                            
                                Importing Numpy results in error even though Anaconda says it's installed?
                            
                                Efficient Double Sum of Products
                            
                                python find string pattern in numpy array of strings
                            
                                how to open chrome in incognito mode from Python
                            
                                Extracting key value pairs from string with quotes
                            
                                How to install Python 3.5 on Raspbian Jessie
                            
                                Anaconda Python virtualdev can't find libpython3.5m.so.1.0 on Windows Subsystem for Linux (Ubuntu 14.04)
                            
                                Repeat list to max number of elements [duplicate]
                            
                                list comprehension in pandas
                            
                                How can I choose the language, using Flask + Babel?
                            
                                Python can't find 'main' module
                            
                                How delete tag from node in lxml without tail?
                            
                                Error message with nltk.sentiment.vader in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is dill much faster and more disk-efficient than pickle for numpy arrays

Tags:

python

serialization

numpy

pickle

dill

Bananach

People also ask

2 Answers

Mike McKerns

Gaëtan de Menten

Recent Activity

Donate For Us