Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Reduce the time taken to load a pickle file in python

I have created a dictionary in python and dumped into pickle. Its size went to 300MB. Now, I want to load the same pickle.

output = open('myfile.pkl', 'rb') mydict = pickle.load(output) 

Loading this pickle takes around 15 seconds. How can I reduce this time?

Hardware Specification: Ubuntu 14.04, 4GB RAM

The code bellow shows how much time takes to dump or load a file using json, pickle, cPickle.

After dumping, file size would be around 300MB.

import json, pickle, cPickle import os, timeit import json  mydict= {all values to be added}  def dump_json():         output = open('myfile1.json', 'wb')     json.dump(mydict, output)     output.close()      def dump_pickle():         output = open('myfile2.pkl', 'wb')     pickle.dump(mydict, output,protocol=cPickle.HIGHEST_PROTOCOL)     output.close()  def dump_cpickle():         output = open('myfile3.pkl', 'wb')     cPickle.dump(mydict, output,protocol=cPickle.HIGHEST_PROTOCOL)     output.close()  def load_json():     output = open('myfile1.json', 'rb')     mydict = json.load(output)     output.close()  def load_pickle():     output = open('myfile2.pkl', 'rb')     mydict = pickle.load(output)     output.close()  def load_cpickle():     output = open('myfile3.pkl', 'rb')     mydict = pickle.load(output)     output.close()   if __name__ == '__main__':     print "Json dump: "     t = timeit.Timer(stmt="pickle_wr.dump_json()", setup="import pickle_wr")       print t.timeit(1),'\n'      print "Pickle dump: "     t = timeit.Timer(stmt="pickle_wr.dump_pickle()", setup="import pickle_wr")       print t.timeit(1),'\n'      print "cPickle dump: "     t = timeit.Timer(stmt="pickle_wr.dump_cpickle()", setup="import pickle_wr")       print t.timeit(1),'\n'      print "Json load: "     t = timeit.Timer(stmt="pickle_wr.load_json()", setup="import pickle_wr")       print t.timeit(1),'\n'      print "pickle load: "     t = timeit.Timer(stmt="pickle_wr.load_pickle()", setup="import pickle_wr")       print t.timeit(1),'\n'      print "cPickle load: "     t = timeit.Timer(stmt="pickle_wr.load_cpickle()", setup="import pickle_wr")       print t.timeit(1),'\n' 

Output :

Json dump:  42.5809804916   Pickle dump:  52.87407804489   cPickle dump:  1.1903790187836   Json load:  12.240660209656   pickle load:  24.48748306274   cPickle load:  24.4888298893 

I have seen that cPickle takes less time to dump and load but loading a file still takes a long time.

like image 929
iNikkz Avatar asked Nov 11 '14 07:11

iNikkz


1 Answers

Try using the json library instead of pickle. This should be an option in your case because you're dealing with a dictionary which is a relatively simple object.

According to this website,

JSON is 25 times faster in reading (loads) and 15 times faster in writing (dumps).

Also see this question: What is faster - Loading a pickled dictionary object or Loading a JSON file - to a dictionary?

Upgrading Python or using the marshal module with a fixed Python version also helps boost speed (code adapted from here):

try: import cPickle except: import pickle as cPickle import pickle import json, marshal, random from time import time from hashlib import md5  test_runs = 1000  if __name__ == "__main__":     payload = {         "float": [(random.randrange(0, 99) + random.random()) for i in range(1000)],         "int": [random.randrange(0, 9999) for i in range(1000)],         "str": [md5(str(random.random()).encode('utf8')).hexdigest() for i in range(1000)]     }     modules = [json, pickle, cPickle, marshal]      for payload_type in payload:         data = payload[payload_type]         for module in modules:             start = time()             if module.__name__ in ['pickle', 'cPickle']:                 for i in range(test_runs): serialized = module.dumps(data, protocol=-1)             else:                 for i in range(test_runs): serialized = module.dumps(data)             w = time() - start             start = time()             for i in range(test_runs):                 unserialized = module.loads(serialized)             r = time() - start             print("%s %s W %.3f R %.3f" % (module.__name__, payload_type, w, r)) 

Results:

C:\Python27\python.exe -u "serialization_benchmark.py" json int W 0.125 R 0.156 pickle int W 2.808 R 1.139 cPickle int W 0.047 R 0.046 marshal int W 0.016 R 0.031 json float W 1.981 R 0.624 pickle float W 2.607 R 1.092 cPickle float W 0.063 R 0.062 marshal float W 0.047 R 0.031 json str W 0.172 R 0.437 pickle str W 5.149 R 2.309 cPickle str W 0.281 R 0.156 marshal str W 0.109 R 0.047  C:\pypy-1.6\pypy-c -u "serialization_benchmark.py" json int W 0.515 R 0.452 pickle int W 0.546 R 0.219 cPickle int W 0.577 R 0.171 marshal int W 0.032 R 0.031 json float W 2.390 R 1.341 pickle float W 0.656 R 0.436 cPickle float W 0.593 R 0.406 marshal float W 0.327 R 0.203 json str W 1.141 R 1.186 pickle str W 0.702 R 0.546 cPickle str W 0.828 R 0.562 marshal str W 0.265 R 0.078  c:\Python34\python -u "serialization_benchmark.py" json int W 0.203 R 0.140 pickle int W 0.047 R 0.062 pickle int W 0.031 R 0.062 marshal int W 0.031 R 0.047 json float W 1.935 R 0.749 pickle float W 0.047 R 0.062 pickle float W 0.047 R 0.062 marshal float W 0.047 R 0.047 json str W 0.281 R 0.187 pickle str W 0.125 R 0.140 pickle str W 0.125 R 0.140 marshal str W 0.094 R 0.078 

Python 3.4 uses pickle protocol 3 as default, which gave no difference compared to protocol 4. Python 2 has protocol 2 as highest pickle protocol (selected if negative value is provided to dump), which is twice as slow as protocol 3.

like image 117
twasbrillig Avatar answered Sep 25 '22 09:09

twasbrillig