Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas msgpack vs pickle

Tags:

msgpack in Pandas is supposed to be a replacement for pickle.

Per the Pandas docs on msgpack:

This is a lightweight portable binary format, similar to binary JSON, that is highly space efficient, and provides good performance both on the writing (serialization), and reading (deserialization).

I find, however, that its performance does not appear to stack up against pickle.

df = pd.DataFrame(np.random.randn(10000, 100))  >>> %timeit df.to_pickle('test.p') 10 loops, best of 3: 22.4 ms per loop  >>> %timeit df.to_msgpack('test.msg') 10 loops, best of 3: 36.4 ms per loop  >>> %timeit pd.read_pickle('test.p') 100 loops, best of 3: 10.5 ms per loop  >>> %timeit pd.read_msgpack('test.msg') 10 loops, best of 3: 24.6 ms per loop 

Question: Asides from potential security issues with pickle, what are the benefits of msgpack over pickle? Is pickle still the preferred method of serializing data, or do better alternatives currently exist?

like image 604
Alexander Avatar asked Jun 04 '15 18:06

Alexander


People also ask

Is parquet faster than pickle?

On write speeds, PICKLE was 30x faster than CSV, MSGPACK and PARQUET were 10X faster, JSON/HDF about the same as CSV. On storage space, GZIPPED PARQUET gave 40X reduction, GZIPPED CSV gave 10X reduction (didn't compare the rest)

What is a pickle in pandas?

Pickle is a serialized way of storing a Pandas dataframe. Basically, you are writing down the exact representation of the dataframe to disk. This means the types of the columns are and the indices are the same. If you simply save a file as csv , you are just storing it as a comma separated list.

How do you pickle a pandas DataFrame?

Pandas DataFrame: to_pickle() function The to_pickle() function is used to pickle (serialize) object to file. File path where the pickled object will be stored. A string representing the compression to use in the output file. By default, infers from the file extension in specified path.


1 Answers

Pickle is better for the following:

  1. Numerical data or anything that uses the buffer protocol (numpy arrays) (though only if you use a somewhat recent protocol=)
  2. Python specific objects like classes, functions, etc.. (although here you should look at cloudpickle)

MsgPack is better for the following:

  1. Cross language interoperation. It's an alternative to JSON with some improvements
  2. Performance on text data and Python objects. It's a decent factor faster than Pickle at this under any setting.

As @Jeff noted above this blogpost may be of interest

like image 89
MRocklin Avatar answered Nov 02 '22 18:11

MRocklin