msgpack
in Pandas is supposed to be a replacement for pickle
.
Per the Pandas docs on msgpack:
This is a lightweight portable binary format, similar to binary JSON, that is highly space efficient, and provides good performance both on the writing (serialization), and reading (deserialization).
I find, however, that its performance does not appear to stack up against pickle.
df = pd.DataFrame(np.random.randn(10000, 100)) >>> %timeit df.to_pickle('test.p') 10 loops, best of 3: 22.4 ms per loop >>> %timeit df.to_msgpack('test.msg') 10 loops, best of 3: 36.4 ms per loop >>> %timeit pd.read_pickle('test.p') 100 loops, best of 3: 10.5 ms per loop >>> %timeit pd.read_msgpack('test.msg') 10 loops, best of 3: 24.6 ms per loop
Question: Asides from potential security issues with pickle, what are the benefits of msgpack over pickle? Is pickle still the preferred method of serializing data, or do better alternatives currently exist?
On write speeds, PICKLE was 30x faster than CSV, MSGPACK and PARQUET were 10X faster, JSON/HDF about the same as CSV. On storage space, GZIPPED PARQUET gave 40X reduction, GZIPPED CSV gave 10X reduction (didn't compare the rest)
Pickle is a serialized way of storing a Pandas dataframe. Basically, you are writing down the exact representation of the dataframe to disk. This means the types of the columns are and the indices are the same. If you simply save a file as csv , you are just storing it as a comma separated list.
Pandas DataFrame: to_pickle() function The to_pickle() function is used to pickle (serialize) object to file. File path where the pickled object will be stored. A string representing the compression to use in the output file. By default, infers from the file extension in specified path.
protocol=
)cloudpickle
)As @Jeff noted above this blogpost may be of interest
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With