Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to specify the pickle protocol when writing pandas to HDF5?

Is there a way to tell Pandas to use a specific pickle protocol (e.g. 4) when writing an HDF5 file?

Here is the situation (much simplified):

  • Client A is using python=3.8.1 (as well as pandas=1.0.0 and pytables=3.6.1). A writes some DataFrame using df.to_hdf(file, key).

  • Client B is using python=3.7.1 (and, as it happened, pandas=0.25.1 and pytables=3.5.2 --but that's irrelevant). B tries to read the data written by A using pd.read_hdf(file, key), and fails with ValueError: unsupported pickle protocol: 5.

Mind you, this doesn't happen with a purely numerical DataFrame (e.g. pd.DataFrame(np.random.normal(size=(10,10))). So here is a reproducible example:

(base) $ conda activate py38
(py38) $ python
Python 3.8.1 (default, Jan  8 2020, 22:29:32)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.DataFrame(['hello', 'world']))
>>> df.to_hdf('foo', 'x')
>>> exit()
(py38) $ conda deactivate
(base) $ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.read_hdf('foo', 'x')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 407, in read_hdf
    return store.select(key, auto_close=auto_close, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 782, in select
    return it.get_result()
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 1639, in get_result
    results = self.func(self.start, self.stop, where)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 766, in func
    return s.read(start=_start, stop=_stop, where=_where, columns=columns)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 3206, in read
    "block{idx}_values".format(idx=i), start=_start, stop=_stop
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 2737, in read_array
    ret = node[0][start:stop]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 681, in __getitem__
    return self.read(start, stop, step)[0]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 825, in read
    outlistarr = [atom.fromarray(arr) for arr in listarr]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 825, in <listcomp>
    outlistarr = [atom.fromarray(arr) for arr in listarr]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/atom.py", line 1227, in fromarray
    return six.moves.cPickle.loads(array.tostring())
ValueError: unsupported pickle protocol: 5
>>>

Note: I tried also reading using pandas=1.0.0 (and pytables=3.6.1) in python=3.7.4. That fails too, so I believe it is simply the Python version (3.8 writer vs 3.7 reader) that causes the problem. This makes sense since pickle protocol 5 was introduced as PEP-574 for Python 3.8.

like image 607
Pierre D Avatar asked Feb 05 '20 01:02

Pierre D


People also ask

Can a pandas DataFrame be pickled?

Pandas DataFrame: to_pickle() functionThe to_pickle() function is used to pickle (serialize) object to file. File path where the pickled object will be stored. A string representing the compression to use in the output file.

Can pandas read HDF5?

Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format.

What is Panda pickle?

Pickle is a serialized way of storing a Pandas dataframe.

Which of the following file format does pandas support for data import?

JSON Files json extension. Python and Pandas work well with JSON files, as Python's json library offers built-in support for them.


1 Answers

PyTable uses the highest protocol by default, which is hardcoded here: https://github.com/PyTables/PyTables/blob/50dc721ab50b56e494a5657e9c8da71776e9f358/tables/atom.py#L1216

As a workaround, you can monkey-patch the pickle module on the client A who writes a HDF file. You should do that before importing pandas:

import pickle
pickle.HIGHEST_PROTOCOL = 4
import pandas

df.to_hdf(file, key)

Now the HDF file has been created using pickle protocol version 4 instead version 5.

like image 88
Piotr Jurkiewicz Avatar answered Sep 19 '22 13:09

Piotr Jurkiewicz