Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to save DataFrame to HDF5 ("object header message is too large")

I have a DataFrame in Pandas:

In [7]: my_df
Out[7]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 34 entries, 0 to 0
Columns: 2661 entries, airplane to zoo
dtypes: float64(2659), object(2)

When I try to save this to disk:

store = pd.HDFStore(p_full_h5)
store.append('my_df', my_df)

I get:

  File "H5A.c", line 254, in H5Acreate2
    unable to create attribute
  File "H5A.c", line 503, in H5A_create
    unable to create attribute in object header
  File "H5Oattribute.c", line 347, in H5O_attr_create
    unable to create new attribute in header
  File "H5Omessage.c", line 224, in H5O_msg_append_real
    unable to create new message
  File "H5Omessage.c", line 1945, in H5O_msg_alloc
    unable to allocate space for message
  File "H5Oalloc.c", line 1142, in H5O_alloc
    object header message is too large

End of HDF5 error back trace

Can't set attribute 'non_index_axes' in node:
 /my_df(Group) u''.

Why?

Note: In case it matters, the DataFrame column names are simple small strings:

In[12]: max([len(x) for x in list(my_df.columns)])
Out{12]: 47

This is all with Pandas 0.11 and the latest stable version of IPython, Python and HDF5.

like image 426
Amelio Vazquez-Reina Avatar asked May 19 '13 21:05

Amelio Vazquez-Reina


3 Answers

HDF5 has a header limit of 64kb for all metadata of the columns. This include name, types, etc. When you go about roughly 2000 columns, you will run out of space to store all the metadata. This is a fundamental limitation of pytables. I don't think they will make workarounds on their side any time soon. You will either have to split the table up or choose another storage format.

like image 59
BW0 Avatar answered Oct 05 '22 21:10

BW0


As of 2014, the hdf is updated

If you are using HDF5 1.8.0 or previous releases, there is a limit on the number 
of fields you can have in a compound datatype. 
This is due to the 64K limit on object header messages, into which datatypes are encoded. (However, you can create a lot of fields before it will fail.
One user was able to create up to 1260 fields in a compound datatype before it failed.)

As for pandas, it can save Dataframe with arbirtary number of columns with format='fixed' option, format 'table' still raises the same error as in topic. I've also tried h5py, and got the error of 'too large header' as well (though I had version > 1.8.0).

like image 39
Alleo Avatar answered Oct 05 '22 21:10

Alleo


Although this thread is more than 5 years old the problem is still relevant. It´s still not possible to save a DataFrame with more than 2000 columns as one table into a HDFStore. Using format='fixed' isn´t an option if one wants to choose which columns to read from the HDFStore later.

Here is a function that splits the DataFrame into smaller ones and stores them as seperate tables. Additionally a pandas.Series is put to the HDFStore that contains the information to which table a column belongs.

def wideDf_to_hdf(filename, data, columns=None, maxColSize=2000, **kwargs):
    """Write a `pandas.DataFrame` with a large number of columns
    to one HDFStore.

    Parameters
    -----------
    filename : str
        name of the HDFStore
    data : pandas.DataFrame
        data to save in the HDFStore
    columns: list
        a list of columns for storing. If set to `None`, all 
        columns are saved.
    maxColSize : int (default=2000)
        this number defines the maximum possible column size of 
        a table in the HDFStore.

    """
    import numpy as np
    from collections import ChainMap
    store = pd.HDFStore(filename, **kwargs)
    if columns is None:
        columns = data.columns
    colSize = columns.shape[0]
    if colSize > maxColSize:
        numOfSplits = np.ceil(colSize / maxColSize).astype(int)
        colsSplit = [
            columns[i * maxColSize:(i + 1) * maxColSize]
            for i in range(numOfSplits)
        ]
        _colsTabNum = ChainMap(*[
            dict(zip(columns, ['data{}'.format(num)] * colSize))
            for num, columns in enumerate(colsSplit)
        ])
        colsTabNum = pd.Series(dict(_colsTabNum)).sort_index()
        for num, cols in enumerate(colsSplit):
            store.put('data{}'.format(num), data[cols], format='table')
        store.put('colsTabNum', colsTabNum, format='fixed')
    else:
        store.put('data', data[columns], format='table')
    store.close()

DataFrames stored into a HDFStore with the function above can be read with the following function.

def read_hdf_wideDf(filename, columns=None, **kwargs):
    """Read a `pandas.DataFrame` from a HDFStore.

    Parameter
    ---------
    filename : str
        name of the HDFStore
    columns : list
        the columns in this list are loaded. Load all columns, 
        if set to `None`.

    Returns
    -------
    data : pandas.DataFrame
        loaded data.

    """
    store = pd.HDFStore(filename)
    data = []
    colsTabNum = store.select('colsTabNum')
    if colsTabNum is not None:
        if columns is not None:
            tabNums = pd.Series(
                index=colsTabNum[columns].values,
                data=colsTabNum[columns].data).sort_index()
            for table in tabNums.unique():
                data.append(
                    store.select(table, columns=tabsNum[table], **kwargs))
        else:
            for table in colsTabNum.unique():
                data.append(store.select(table, **kwargs))
        data = pd.concat(data, axis=1).sort_index(axis=1)
    else:
        data = store.select('data', columns=columns)
    store.close()
    return data
like image 21
Volker L. Avatar answered Oct 05 '22 19:10

Volker L.