<p><code>msgpack</code> in Pandas is supposed to be a replacement for <code>pickle</code>.</p> <p>Per the Pandas docs on msgpack:</p> <blockquote> <p>This is a lightweight portable binary format, similar to binary JSON, that is highly space efficient, and provides good performance both on the writing (serialization), and reading (deserialization).</p> </blockquote> <p>I find, however, that its performance does not appear to stack up against pickle.</p> <pre class="prettyprint"><code>df = pd.DataFrame(np.random.randn(10000, 100)) >>> %timeit df.to_pickle('test.p') 10 loops, best of 3: 22.4 ms per loop >>> %timeit df.to_msgpack('test.msg') 10 loops, best of 3: 36.4 ms per loop >>> %timeit pd.read_pickle('test.p') 100 loops, best of 3: 10.5 ms per loop >>> %timeit pd.read_msgpack('test.msg') 10 loops, best of 3: 24.6 ms per loop </code></pre> <p><strong>Question:</strong> Asides from potential security issues with pickle, what are the benefits of msgpack over pickle? Is pickle still the preferred method of serializing data, or do better alternatives currently exist?</p>

<h3>Pickle is better for the following:</h3> <ol> <li>Numerical data or anything that uses the buffer protocol (numpy arrays) (though only if you use a somewhat recent <code>protocol=</code>)</li> <li>Python specific objects like classes, functions, etc.. (although here you should look at <code>cloudpickle</code>)</li> </ol> <h3>MsgPack is better for the following:</h3> <ol> <li>Cross language interoperation. It's an alternative to JSON with some improvements</li> <li>Performance on text data and Python objects. It's a decent factor faster than Pickle at this under any setting. </li> </ol> <p>As @Jeff noted above this blogpost may be of interest </p>

Pandas msgpack vs pickle

Tags:

msgpack in Pandas is supposed to be a replacement for pickle.

Per the Pandas docs on msgpack:

This is a lightweight portable binary format, similar to binary JSON, that is highly space efficient, and provides good performance both on the writing (serialization), and reading (deserialization).

I find, however, that its performance does not appear to stack up against pickle.

df = pd.DataFrame(np.random.randn(10000, 100))  >>> %timeit df.to_pickle('test.p') 10 loops, best of 3: 22.4 ms per loop  >>> %timeit df.to_msgpack('test.msg') 10 loops, best of 3: 36.4 ms per loop  >>> %timeit pd.read_pickle('test.p') 100 loops, best of 3: 10.5 ms per loop  >>> %timeit pd.read_msgpack('test.msg') 10 loops, best of 3: 24.6 ms per loop

Question: Asides from potential security issues with pickle, what are the benefits of msgpack over pickle? Is pickle still the preferred method of serializing data, or do better alternatives currently exist?

604

asked Jun 04 '15 18:06

Alexander

1 Answers

Pickle is better for the following:

Numerical data or anything that uses the buffer protocol (numpy arrays) (though only if you use a somewhat recent protocol=)
Python specific objects like classes, functions, etc.. (although here you should look at cloudpickle)

MsgPack is better for the following:

Cross language interoperation. It's an alternative to JSON with some improvements
Performance on text data and Python objects. It's a decent factor faster than Pickle at this under any setting.

As @Jeff noted above this blogpost may be of interest

answered Nov 02 '22 18:11

MRocklin

Related questions
                            
                                Git: split pull request into smaller PR's based upon the new directories in the pull request
                            
                                How do I use classes from another project in IntelliJ IDEA?
                            
                                What is the difference between a message channel and the message queue itself?
                            
                                Applying Effects on Video being Played
                            
                                Two-way binding between parent and child custom element in Aurelia
                            
                                Can I disable animation for Xcode-UI-Tests?
                            
                                Is the operand of `sizeof` evaluated with a VLA?
                            
                                Bypassing iframe sandbox?
                            
                                How can one attach a decorator to a function "after the fact" in python?
                            
                                How to disable 'This type of file can harm your computer' pop up
                            
                                Remove everything except a certain pattern
                            
                                Can we load Parquet file into Hive directly?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With