Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HDF5 string serialization details in pandas?

Tags:

pandas

saddle

I am the author of Saddle (saddle.github.io), which provides functionality similar in spirit to pandas (but in Scala on the JVM). I'm trying to ensure that the HDF5 serialization format of pandas' DataFrame is interoperable with that of Saddle. I'm currently implementing string array serialization in Saddle. So my question is how the pandas DataFrame serializes strings. If I create an HDF5 file in pandas as follows:

from pandas import *
h = HDFStore('tmp.h5')
f = DataFrame({0: [1,2,3], 1: ["a", "b", "c"], 2: [1.5, 2.5, 3.5]})
h.put("f1", f)
h.close()

And h5dump the resulting tmp.h5 file, I see that the string block (block2_values) is stored as datatype H5T_VLEN and attribute

 ATTRIBUTE "CLASS" {
    DATATYPE  H5T_STRING {
          STRSIZE 8;
          STRPAD H5T_STR_NULLTERM;
          CSET H5T_CSET_ASCII;
          CTYPE H5T_C_S1;
       }
    DATASPACE  SCALAR
    DATA {
    (0): "VLARRAY"
    }
 }

This hints at an ASCII character set; however, the bytes I see encoded do not seem to correspond to ASCII (ie, "a", "b", "c"). Also, I'm curious where STRSIZE 8 comes from. Can anyone shed light on the implementation details of string serialization which occurs via pandas -> pytables -> hdf5? (I'd also be happy with any pointers to code in pandas/pytables where I can start digging deeper myself :)

like image 515
Adam Klein Avatar asked Jun 11 '13 20:06

Adam Klein


1 Answers

You picked an example that on the surface seems very simple, but is actually fairly complicated behind the scenes. This ends up storing 3 different blocks of data (1 for each dtype), and each of these stores and index and the data.

The object which you stored is what I call a Storer format, meaning the numpy arrays are written all at once, so once written they are not changeable. See docs here: http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables

PyTables docs are here: http://pytables.github.io/usersguide/libref/declarative_classes.html#the-atom-class-and-its-descendants

These strings unfortunately are stored as a python pickle in this particular format of storage, so I don't know if you can decode them cross-platform.

You will have an easier time reading a Table object, which is stored using more basic types, that are easily exported (there is a section in the docs on exporting to R for example).

try reading this format:

In [2]: df = DataFrame({0: [1,2,3], 1: ["a", "b", "c"], 2: [1.5, 2.5, 3.5]})

In [4]: h = pd.HDFStore('tmp.h5')

In [6]: h.put('df',df, table=True)

In [7]: h.close()

using the PyTables ptdump -avd tmp.h5 utility, this yields the following. If you are reading < PyTables 3.0.0 (which just came out), or in py3 (which we are going to support in 0.11.1). Then strings are all utf-8 encoded written as bytes. Prior to (PyTables 3.0.0,), strings are written as ascii I believe.

/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.0',
    TITLE := '',
    VERSION := '1.0']
/df (Group) ''
  /df._v_attrs (AttributeSet), 12 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := [],
    index_cols := [(0, 'index')],
    levels := 1,
    nan_rep := b'nan',
    non_index_axes := b"(lp1\n(I1\n(lp2\ncnumpy.core.multiarray\nscalar\np3\n(cnumpy\ndtype\np4\n(S'i8'\nI0\nI1\ntRp5\n(I3\nS'<'\nNNNI-1\nI-1\nI0\ntbS'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp6\nag3\n(g5\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp7\nag3\n(g5\nS'\\x02\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp8\natp9\na.",
    pandas_type := b'frame_table',
    pandas_version := b'0.10.1',
    table_type := b'appendable_frame',
    values_cols := ['values_block_0', 'values_block_1', 'values_block_2']]
/df/table (Table(3,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
  "values_block_2": StringCol(itemsize=1, shape=(1,), dflt=b'', pos=3)}
  byteorder := 'little'
  chunkshape := (2621,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df/table._v_attrs (AttributeSet), 19 attributes:
   [CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'values_block_0',
    FIELD_2_FILL := 0,
    FIELD_2_NAME := 'values_block_1',
    FIELD_3_FILL := b'',
    FIELD_3_NAME := 'values_block_2',
    NROWS := 3,
    TITLE := '',
    VERSION := '2.6',
    index_kind := b'integer',
    values_block_0_dtype := b'float64',
    values_block_0_kind := b"(lp1\ncnumpy.core.multiarray\nscalar\np2\n(cnumpy\ndtype\np3\n(S'i8'\nI0\nI1\ntRp4\n(I3\nS'<'\nNNNI-1\nI-1\nI0\ntbS'\\x02\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp5\na.",
    values_block_1_dtype := b'int64',
    values_block_1_kind := b"(lp1\ncnumpy.core.multiarray\nscalar\np2\n(cnumpy\ndtype\np3\n(S'i8'\nI0\nI1\ntRp4\n(I3\nS'<'\nNNNI-1\nI-1\nI0\ntbS'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp5\na.",
    values_block_2_dtype := b'string8',
    values_block_2_kind := b"(lp1\ncnumpy.core.multiarray\nscalar\np2\n(cnumpy\ndtype\np3\n(S'i8'\nI0\nI1\ntRp4\n(I3\nS'<'\nNNNI-1\nI-1\nI0\ntbS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp5\na."]
  Data dump:
[0] (0, [1.5], [1], [b'a'])
[1] (1, [2.5], [2], [b'b'])
[2] (2, [3.5], [3], [b'c'])

Probably best to contact me off-line to discuss further.

like image 131
Jeff Avatar answered Sep 24 '22 01:09

Jeff