Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Save pandas DataFrame using h5py for interoperabilty with other hdf5 readers

Here is a sample data frame:

import pandas as pd

NaN = float('nan')
ID = [1, 2, 3, 4, 5, 6, 7]
A = [NaN, NaN, NaN, 0.1, 0.1, 0.1, 0.1]
B = [0.2, NaN, 0.2, 0.2, 0.2, NaN, NaN]
C = [NaN, 0.5, 0.5, NaN, 0.5, 0.5, NaN]
columns = {'A':A, 'B':B, 'C':C}
df = pd.DataFrame(columns, index=ID)
df.index.name = 'ID'
print(df)

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

I know that pandas has the pytables based HDFStore, which is an easy way to efficiently serialize/deserialize a data frame. But those datasets are not very easy to load directly using a reader h5py or matlab. How can I save a data frame using h5py, so that I can easily load it back using another hdf5 reader?

like image 436
Phil Avatar asked Jun 11 '15 06:06

Phil


2 Answers

Here is my approach to solving this problem. I am hoping either someone else has a better solution or my approach is helpful to others.

First, define function to make a numpy structure array (not a record array) from a pandas DataFrame.

import numpy as np
def df_to_sarray(df):
    """
    Convert a pandas DataFrame object to a numpy structured array.
    This is functionally equivalent to but more efficient than
    np.array(df.to_array())

    :param df: the data frame to convert
    :return: a numpy structured array representation of df
    """

    v = df.values
    cols = df.columns
    types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)]
    dtype = np.dtype(types)
    z = np.zeros(v.shape[0], dtype)
    for (i, k) in enumerate(z.dtype.names):
        z[k] = v[:, i]
    return z

Use reset_index to make a new data frame that includes the index as part of its data. Convert that data frame to a structure array.

sa = df_to_sarray(df.reset_index())
sa

array([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5),
       (4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5),
       (7L, 0.1, nan, nan)], 
      dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

Save that structured array to an hdf5 file.

import h5py
with h5py.File('mydata.h5', 'w') as hf:
            hf['df'] = sa

Load the h5 dataset

with h5py.File('mydata.h5') as hf:
            sa2 = hf['df'][:]

Extract the ID column and delete it from sa2

import numpy.lib.recfunctions as nprec
ID = sa2['ID']
sa2 = nprec.drop_fields(sa2, 'ID')

Make data frame with index ID using sa2

df2 = pd.DataFrame(sa2, index=ID)
df2.index.name = 'ID'

print(df2)

      A    B    C
ID               
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN
like image 50
Phil Avatar answered Oct 24 '22 01:10

Phil


The pandas HDFStore format is standard HDF5, with just a convention for how to interpret the meta-data. Docs are here

In [54]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)

In [55]: h = h5py.File('test.h5')

In [56]: h['df']['table']
Out[56]: <HDF5 dataset "table": shape (7,), type "|V32">

In [64]: h['df']['table'][:]
Out[64]: 
array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
       (4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
       (7, 0.1, nan, nan)], 
      dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])


In [57]: h['df']['table'].attrs.items()
Out[57]: 
[(u'CLASS', 'TABLE'),
 (u'VERSION', '2.7'),
 (u'TITLE', ''),
 (u'FIELD_0_NAME', 'index'),
 (u'FIELD_1_NAME', 'A'),
 (u'FIELD_2_NAME', 'B'),
 (u'FIELD_3_NAME', 'C'),
 (u'FIELD_0_FILL', 0),
 (u'FIELD_1_FILL', 0.0),
 (u'FIELD_2_FILL', 0.0),
 (u'FIELD_3_FILL', 0.0),
 (u'index_kind', 'integer'),
 (u'A_kind', "(lp1\nS'A'\na."),
 (u'A_meta', 'N.'),
 (u'A_dtype', 'float64'),
 (u'B_kind', "(lp1\nS'B'\na."),
 (u'B_meta', 'N.'),
 (u'B_dtype', 'float64'),
 (u'C_kind', "(lp1\nS'C'\na."),
 (u'C_meta', 'N.'),
 (u'C_dtype', 'float64'),
 (u'NROWS', 7)]

In [58]: h.close()

The data will be completely readable in any HDF5 reader. Some of the meta-data is pickled, so care must be taken.

like image 40
Jeff Avatar answered Oct 24 '22 01:10

Jeff