I am the author of Saddle (saddle.github.io), which provides functionality similar in spirit to pandas (but in Scala on the JVM). I'm trying to ensure that the HDF5 serialization format of pandas' DataFrame is interoperable with that of Saddle. I'm currently implementing string array serialization in Saddle. So my question is how the pandas DataFrame serializes strings. If I create an HDF5 file in pandas as follows:
from pandas import *
h = HDFStore('tmp.h5')
f = DataFrame({0: [1,2,3], 1: ["a", "b", "c"], 2: [1.5, 2.5, 3.5]})
h.put("f1", f)
h.close()
And h5dump the resulting tmp.h5 file, I see that the string block (block2_values) is stored as datatype H5T_VLEN and attribute
ATTRIBUTE "CLASS" {
DATATYPE H5T_STRING {
STRSIZE 8;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "VLARRAY"
}
}
This hints at an ASCII character set; however, the bytes I see encoded do not seem to correspond to ASCII (ie, "a", "b", "c"). Also, I'm curious where STRSIZE 8 comes from. Can anyone shed light on the implementation details of string serialization which occurs via pandas -> pytables -> hdf5? (I'd also be happy with any pointers to code in pandas/pytables where I can start digging deeper myself :)
You picked an example that on the surface seems very simple, but is actually fairly complicated behind the scenes. This ends up storing 3 different blocks of data (1 for each dtype), and each of these stores and index and the data.
The object which you stored is what I call a Storer
format, meaning the numpy arrays are written all at once, so once written they are not changeable. See docs here: http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables
PyTables docs are here: http://pytables.github.io/usersguide/libref/declarative_classes.html#the-atom-class-and-its-descendants
These strings unfortunately are stored as a python pickle in this particular format of storage, so I don't know if you can decode them cross-platform.
You will have an easier time reading a Table
object, which is stored using more basic types, that are easily exported (there is a section in the docs on exporting to R for example).
try reading this format:
In [2]: df = DataFrame({0: [1,2,3], 1: ["a", "b", "c"], 2: [1.5, 2.5, 3.5]})
In [4]: h = pd.HDFStore('tmp.h5')
In [6]: h.put('df',df, table=True)
In [7]: h.close()
using the PyTables ptdump -avd tmp.h5
utility, this yields the following. If you are reading < PyTables 3.0.0 (which just came out), or in py3 (which we are going to support in 0.11.1). Then strings are all utf-8 encoded written as bytes. Prior to (PyTables 3.0.0,), strings are written as ascii I believe.
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.0',
TITLE := '',
VERSION := '1.0']
/df (Group) ''
/df._v_attrs (AttributeSet), 12 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := [],
index_cols := [(0, 'index')],
levels := 1,
nan_rep := b'nan',
non_index_axes := b"(lp1\n(I1\n(lp2\ncnumpy.core.multiarray\nscalar\np3\n(cnumpy\ndtype\np4\n(S'i8'\nI0\nI1\ntRp5\n(I3\nS'<'\nNNNI-1\nI-1\nI0\ntbS'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp6\nag3\n(g5\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp7\nag3\n(g5\nS'\\x02\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp8\natp9\na.",
pandas_type := b'frame_table',
pandas_version := b'0.10.1',
table_type := b'appendable_frame',
values_cols := ['values_block_0', 'values_block_1', 'values_block_2']]
/df/table (Table(3,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
"values_block_2": StringCol(itemsize=1, shape=(1,), dflt=b'', pos=3)}
byteorder := 'little'
chunkshape := (2621,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/df/table._v_attrs (AttributeSet), 19 attributes:
[CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'values_block_0',
FIELD_2_FILL := 0,
FIELD_2_NAME := 'values_block_1',
FIELD_3_FILL := b'',
FIELD_3_NAME := 'values_block_2',
NROWS := 3,
TITLE := '',
VERSION := '2.6',
index_kind := b'integer',
values_block_0_dtype := b'float64',
values_block_0_kind := b"(lp1\ncnumpy.core.multiarray\nscalar\np2\n(cnumpy\ndtype\np3\n(S'i8'\nI0\nI1\ntRp4\n(I3\nS'<'\nNNNI-1\nI-1\nI0\ntbS'\\x02\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp5\na.",
values_block_1_dtype := b'int64',
values_block_1_kind := b"(lp1\ncnumpy.core.multiarray\nscalar\np2\n(cnumpy\ndtype\np3\n(S'i8'\nI0\nI1\ntRp4\n(I3\nS'<'\nNNNI-1\nI-1\nI0\ntbS'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp5\na.",
values_block_2_dtype := b'string8',
values_block_2_kind := b"(lp1\ncnumpy.core.multiarray\nscalar\np2\n(cnumpy\ndtype\np3\n(S'i8'\nI0\nI1\ntRp4\n(I3\nS'<'\nNNNI-1\nI-1\nI0\ntbS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp5\na."]
Data dump:
[0] (0, [1.5], [1], [b'a'])
[1] (1, [2.5], [2], [b'b'])
[2] (2, [3.5], [3], [b'c'])
Probably best to contact me off-line to discuss further.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With