I'm using Python 2.7 and NumPy 1.11.2, as well as the latest versions of dill ( I just did the pip install dill
) , on Ubuntu 16.04.
When storing a NumPy array using pickle, I find that pickle is very slow, and stores arrays at almost three times the 'necessary' size.
For example, in the following code, pickle is approximately 50 times slower (1s versus 50s), and creates a file that is 2.2GB instead of 800MB.
import numpy
import pickle
import dill
B=numpy.random.rand(10000,10000)
with open('dill','wb') as fp:
dill.dump(B,fp)
with open('pickle','wb') as fp:
pickle.dump(B,fp)
I thought dill was just a wrapper around pickle. If this is true, is there a way that I can improve the performance of pickle myself? Is it generally not advisable to use pickle for NumPy arrays?
EDIT: Using Python3, I get the same performance for pickle
and dill
PS: I know about numpy.save
, but I am working in a framework where I store lots of different objects, all residing in a dictionary, to a file.
NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.
quickle is a fast and small serialization format for a subset of Python types. It's based off of Pickle, but includes several optimizations and extensions to provide improved performance and security. For supported types, serializing a message with quickle can be ~2-10x faster than using pickle .
dill provides the user the same interface as the pickle module, and also includes some additional features. In addition to pickling python objects, dill provides the ability to save the state of an interpreter session in a single command.
NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data. Typically, such operations are executed more efficiently and with less code than is possible using Python's built-in sequences.
I'm the dill
author. dill
is an extension of pickle
, but it does add some alternate pickling methods for numpy
and other objects. For example, dill
leverages the numpy
methods for the pickling of arrays.
Additionally, (I believe) dill
uses DEFAULT_PROTOCOL
by default (not HIGHEST_PROTOCOL
), for python3, and for python2 it uses HIGHEST_PROTOCOL
by default.
This ought to be a comment, but I have not enough reputation... My guess is that this is due to the pickle protocol used.
On Python 2, the default protocol is 0 and highest supported protocol is 2. On Python 3, the default protocol is 3 and highest supported protocol is 4 (as of Python 3.6).
Each protocol version improves on the previous one, but protocol 0 is especially slow for largish objects. It should be avoided in most cases, except if you need to be able to read your pickles using extremely old versions of Python. Protocol 2 is already much better.
Now, I suppose dill uses pickle.HIGHEST_PROTOCOL by default, and if that is indeed the case, it would probably be the cause of a good deal of the speed difference. You could try using pickle.HIGHEST_PROTOCOL to see if you get similar performance using dill and standard pickle.
with open('dill', 'wb') as fp:
dill.dump(B, fp, protocol=pickle.HIGHEST_PROTOCOL)
with open('pickle', 'wb') as fp:
pickle.dump(B, fp, protocol=pickle.HIGHEST_PROTOCOL)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With