I'm implementing a program that needs to serialize and deserialize large objects, so I was making some tests with <code>pickle</code>, <code>cPickle</code> and <code>marshal</code> modules to choose the best module. Along the way I found something very interesting: I'm using <code>dumps</code> and then <code>loads</code> (for each module) on a list of dicts, tuples, ints, float and strings. This is the output of my benchmark: <pre class="prettyprint"><code>DUMPING a list of length 7340032 ---------------------------------------------------------------------- pickle => 14.675 seconds length of pickle serialized string: 31457430 cPickle => 2.619 seconds length of cPickle serialized string: 31457457 marshal => 0.991 seconds length of marshal serialized string: 117440540 LOADING a list of length: 7340032 ---------------------------------------------------------------------- pickle => 13.768 seconds (same length?) 7340032 == 7340032 cPickle => 2.038 seconds (same length?) 7340032 == 7340032 marshal => 6.378 seconds (same length?) 7340032 == 7340032 </code></pre> So, from these results we can see that <code>marshal</code> was extremely fast in the dumping part of the benchmark: <blockquote> 14.8x times faster than <code>pickle</code> and 2.6x times faster than <code>cPickle</code>. </blockquote> But, for my big surprise, <code>marshal</code> was by far slower than <code>cPickle</code> in the loading part: <blockquote> 2.2x times faster than <code>pickle</code>, but 3.1x times slower than <code>cPickle</code>. </blockquote> And as for RAM, <code>marshal</code> performance while loading was also very inefficient: <img src="https://i.stack.imgur.com/ZAFoV.png" alt="Ubuntu System Monitor"> I'm guessing the reason why loading with <code>marshal</code> is so slow is somehow related with the length of the its serialized string (much longer than <code>pickle</code> and <code>cPickle</code>). <ul> <li>Why <code>marshal</code> dumps faster and loads slower?</li> <li>Why <code>marshal</code> serialized string is so long?</li> <li>Why <code>marshal</code>'s loading is so inefficient in RAM?</li> <li>Is there a way to improve <code>marshal</code>'s loading performance?</li> <li>Is there a way to merge <code>marshal</code> fast dumping with <code>cPickle</code> fast loading?</li> </ul>

<code>cPickle</code> has a smarter algorithm than <code>marshal</code> and is able to do tricks to reduce the space used by large objects. That means it'll be slower to decode but faster to encode as the resulting output is smaller. <code>marshal</code> is simplistic and serializes the object straight as-is without doing any further analyze it. That also answers why the <code>marshal</code> loading is so inefficient, it simply has to do more work - as in reading more data from disk - to be able to do the same thing as <code>cPickle</code>. <code>marshal</code> and <code>cPickle</code> are really different things in the end, you can't really get both fast saving and fast loading since fast saving implies analyzing the data structures less which implies saving a lot of data to disk. Regarding the fact that <code>marshal</code> might be incompatible to other versions of Python, you should generally use <code>cPickle</code>: <blockquote> "This is not a general “persistence” module. For general persistence and transfer of Python objects through RPC calls, see the modules pickle and shelve. The marshal module exists mainly to support reading and writing the “pseudo-compiled” code for Python modules of .pyc files. Therefore, the Python maintainers reserve the right to modify the marshal format in backward incompatible ways should the need arise. If you’re serializing and de-serializing Python objects, use the pickle module instead – the performance is comparable, version independence is guaranteed, and pickle supports a substantially wider range of objects than marshal." (the python docs about marshal) </blockquote>

marshal dumps faster, cPickle loads faster

Tags:

performance

python

serialization

I'm implementing a program that needs to serialize and deserialize large objects, so I was making some tests with pickle, cPickle and marshal modules to choose the best module. Along the way I found something very interesting:

I'm using dumps and then loads (for each module) on a list of dicts, tuples, ints, float and strings.

This is the output of my benchmark:

DUMPING a list of length 7340032
----------------------------------------------------------------------
pickle => 14.675 seconds
length of pickle serialized string: 31457430

cPickle => 2.619 seconds
length of cPickle serialized string: 31457457

marshal => 0.991 seconds
length of marshal serialized string: 117440540

LOADING a list of length: 7340032
----------------------------------------------------------------------
pickle => 13.768 seconds
(same length?) 7340032 == 7340032

cPickle => 2.038 seconds
(same length?) 7340032 == 7340032

marshal => 6.378 seconds
(same length?) 7340032 == 7340032

So, from these results we can see that marshal was extremely fast in the dumping part of the benchmark:

14.8x times faster than pickle and 2.6x times faster than cPickle.

But, for my big surprise, marshal was by far slower than cPickle in the loading part:

2.2x times faster than pickle, but 3.1x times slower than cPickle.

And as for RAM, marshal performance while loading was also very inefficient:

Ubuntu System Monitor

I'm guessing the reason why loading with marshal is so slow is somehow related with the length of the its serialized string (much longer than pickle and cPickle).

Why marshal dumps faster and loads slower?
Why marshal serialized string is so long?
Why marshal's loading is so inefficient in RAM?
Is there a way to improve marshal's loading performance?
Is there a way to merge marshal fast dumping with cPickle fast loading?

445

asked Dec 15 '11 01:12

juliomalegria

1 Answers

cPickle has a smarter algorithm than marshal and is able to do tricks to reduce the space used by large objects. That means it'll be slower to decode but faster to encode as the resulting output is smaller. marshal is simplistic and serializes the object straight as-is without doing any further analyze it. That also answers why the marshal loading is so inefficient, it simply has to do more work - as in reading more data from disk - to be able to do the same thing as cPickle.

marshal and cPickle are really different things in the end, you can't really get both fast saving and fast loading since fast saving implies analyzing the data structures less which implies saving a lot of data to disk.

Regarding the fact that marshal might be incompatible to other versions of Python, you should generally use cPickle:

"This is not a general “persistence” module. For general persistence and transfer of Python objects through RPC calls, see the modules pickle and shelve. The marshal module exists mainly to support reading and writing the “pseudo-compiled” code for Python modules of .pyc files. Therefore, the Python maintainers reserve the right to modify the marshal format in backward incompatible ways should the need arise. If you’re serializing and de-serializing Python objects, use the pickle module instead – the performance is comparable, version independence is guaranteed, and pickle supports a substantially wider range of objects than marshal." (the python docs about marshal)

191

answered Sep 21 '22 15:09

Johan Dahlin

Related questions
                            
                                ImportError: No module named pythoncom
                            
                                Which Python async library would be best suited for my code? Asyncore? Twisted?
                            
                                How to specify floating point decimal precision from variable?
                            
                                How can I splice a string?
                            
                                Configuration setting for Vim PEP-8 plugin to ignore errors and warnings?
                            
                                Install m2crypto on a virtualenv without system packages
                            
                                How can I convert RGB to CMYK and vice versa in python?
                            
                                Python Pandas - Deleting multiple series from a data frame in one command
                            
                                Following hyperlink and "Filtered offsite request"
                            
                                Python psycopg2 check row exists
                            
                                Dot notation string manipulation
                            
                                Scikit-learn grid search with SVM regression
                            
                                How to serialize a Marshmallow field under a different name
                            
                                Flask and Keras model Error ''_thread._local' object has no attribute 'value''?
                            
                                Tool to enforce python code style/standards [closed]
                            
                                How to extract a string between 2 other strings in python?
                            
                                Installing pythonstartup file
                            
                                django: how to do calculation inside the template html page?
                            
                                Best practices for getting the most testing coverage with Django/Python?
                            
                                Iterating over submitted form fields in Flask?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With