I have created a dictionary in python and dumped into pickle. Its size went to 300MB. Now, I want to load the same pickle.
output = open('myfile.pkl', 'rb') mydict = pickle.load(output)
Loading this pickle takes around 15 seconds. How can I reduce this time?
Hardware Specification: Ubuntu 14.04, 4GB RAM
The code bellow shows how much time takes to dump or load a file using json, pickle, cPickle.
After dumping, file size would be around 300MB.
import json, pickle, cPickle import os, timeit import json mydict= {all values to be added} def dump_json(): output = open('myfile1.json', 'wb') json.dump(mydict, output) output.close() def dump_pickle(): output = open('myfile2.pkl', 'wb') pickle.dump(mydict, output,protocol=cPickle.HIGHEST_PROTOCOL) output.close() def dump_cpickle(): output = open('myfile3.pkl', 'wb') cPickle.dump(mydict, output,protocol=cPickle.HIGHEST_PROTOCOL) output.close() def load_json(): output = open('myfile1.json', 'rb') mydict = json.load(output) output.close() def load_pickle(): output = open('myfile2.pkl', 'rb') mydict = pickle.load(output) output.close() def load_cpickle(): output = open('myfile3.pkl', 'rb') mydict = pickle.load(output) output.close() if __name__ == '__main__': print "Json dump: " t = timeit.Timer(stmt="pickle_wr.dump_json()", setup="import pickle_wr") print t.timeit(1),'\n' print "Pickle dump: " t = timeit.Timer(stmt="pickle_wr.dump_pickle()", setup="import pickle_wr") print t.timeit(1),'\n' print "cPickle dump: " t = timeit.Timer(stmt="pickle_wr.dump_cpickle()", setup="import pickle_wr") print t.timeit(1),'\n' print "Json load: " t = timeit.Timer(stmt="pickle_wr.load_json()", setup="import pickle_wr") print t.timeit(1),'\n' print "pickle load: " t = timeit.Timer(stmt="pickle_wr.load_pickle()", setup="import pickle_wr") print t.timeit(1),'\n' print "cPickle load: " t = timeit.Timer(stmt="pickle_wr.load_cpickle()", setup="import pickle_wr") print t.timeit(1),'\n'
Output :
Json dump: 42.5809804916 Pickle dump: 52.87407804489 cPickle dump: 1.1903790187836 Json load: 12.240660209656 pickle load: 24.48748306274 cPickle load: 24.4888298893
I have seen that cPickle takes less time to dump and load but loading a file still takes a long time.
Try using the json
library instead of pickle
. This should be an option in your case because you're dealing with a dictionary which is a relatively simple object.
According to this website,
JSON is 25 times faster in reading (loads) and 15 times faster in writing (dumps).
Also see this question: What is faster - Loading a pickled dictionary object or Loading a JSON file - to a dictionary?
Upgrading Python or using the marshal
module with a fixed Python version also helps boost speed (code adapted from here):
try: import cPickle except: import pickle as cPickle import pickle import json, marshal, random from time import time from hashlib import md5 test_runs = 1000 if __name__ == "__main__": payload = { "float": [(random.randrange(0, 99) + random.random()) for i in range(1000)], "int": [random.randrange(0, 9999) for i in range(1000)], "str": [md5(str(random.random()).encode('utf8')).hexdigest() for i in range(1000)] } modules = [json, pickle, cPickle, marshal] for payload_type in payload: data = payload[payload_type] for module in modules: start = time() if module.__name__ in ['pickle', 'cPickle']: for i in range(test_runs): serialized = module.dumps(data, protocol=-1) else: for i in range(test_runs): serialized = module.dumps(data) w = time() - start start = time() for i in range(test_runs): unserialized = module.loads(serialized) r = time() - start print("%s %s W %.3f R %.3f" % (module.__name__, payload_type, w, r))
Results:
C:\Python27\python.exe -u "serialization_benchmark.py" json int W 0.125 R 0.156 pickle int W 2.808 R 1.139 cPickle int W 0.047 R 0.046 marshal int W 0.016 R 0.031 json float W 1.981 R 0.624 pickle float W 2.607 R 1.092 cPickle float W 0.063 R 0.062 marshal float W 0.047 R 0.031 json str W 0.172 R 0.437 pickle str W 5.149 R 2.309 cPickle str W 0.281 R 0.156 marshal str W 0.109 R 0.047 C:\pypy-1.6\pypy-c -u "serialization_benchmark.py" json int W 0.515 R 0.452 pickle int W 0.546 R 0.219 cPickle int W 0.577 R 0.171 marshal int W 0.032 R 0.031 json float W 2.390 R 1.341 pickle float W 0.656 R 0.436 cPickle float W 0.593 R 0.406 marshal float W 0.327 R 0.203 json str W 1.141 R 1.186 pickle str W 0.702 R 0.546 cPickle str W 0.828 R 0.562 marshal str W 0.265 R 0.078 c:\Python34\python -u "serialization_benchmark.py" json int W 0.203 R 0.140 pickle int W 0.047 R 0.062 pickle int W 0.031 R 0.062 marshal int W 0.031 R 0.047 json float W 1.935 R 0.749 pickle float W 0.047 R 0.062 pickle float W 0.047 R 0.062 marshal float W 0.047 R 0.047 json str W 0.281 R 0.187 pickle str W 0.125 R 0.140 pickle str W 0.125 R 0.140 marshal str W 0.094 R 0.078
Python 3.4 uses pickle protocol 3 as default, which gave no difference compared to protocol 4. Python 2 has protocol 2 as highest pickle protocol (selected if negative value is provided to dump), which is twice as slow as protocol 3.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With