I am trying to serialize a large (~10**6 rows, each with ~20 values) list, to be used later by myself (so pickle's lack of safety isn't a concern).
Each row of the list is a tuple of values, derived from some SQL database. So far, I have seen datetime.datetime
, strings, integers, and NoneType, but I might eventually have to support additional data types.
For serialization, I've considered pickle (cPickle), json, and plain text - but only pickle saves the type information: json can't serialize datetime.datetime
, and plain text has its obvious disadvantages.
However, cPickle is pretty slow for data this large, and I'm looking for a faster alternative.
Joblib is the replacement of pickle as it is more efficient on objects that carry large numpy arrays. These functions also accept file-like object instead of filenames.
JSON is a lightweight format and is much faster than Pickling. There is always a security risk with Pickle. Unpickling data from unknown sources should be avoided as it may contain malicious or erroneous data. There are no loopholes in security using JSON, and it is free from security threats.
The pickle data format is standardized, so strings serialized with pickle can be deserialized with cPickle and vice versa. The main difference between cPickle and pickle is performance. The cPickle module is many times faster to execute because it's written in C and because its methods are functions instead of classes.
The advantage of using pickle is that it can serialize pretty much any Python object, without having to add any extra code. Its also smart in that in will only write out any single object once, making it effective to store recursive structures like graphs.
Pickle is actually quite fast so long as you aren't using the (default) ASCII protocol. Just make sure to dump using protocol=pickle.HIGHEST_PROTOCOL
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With