Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Appending to HDFStore fails with "cannot match existing table structure"

The final solution was to use the "converters" parameter of read_csv and check every value before adding it to the DataFrame. In the end there were only 2 broken values in over 80GB of raw data.

The parameter looks like this:

converters={'XXXXX': self.parse_xxxxx}

And the small static helper method like this:

@staticmethod
def parse_xxxxx(input):
    if not isinstance(input, float):
        try:
            return float(input)
        except ValueError:
            print "Broken Value: ", input
            return float(0.0)
    else:
         return input

While trying to read ca. 40GB+ of csv data into a HDF file I ran into a confusing problem. After reading about 1GB the entire process fails with the following error

File "/usr/lib/python2.7/dist-packages/pandas/io/pytables.py", line 658, in append
    self._write_to_group(key, value, table=True, append=True, **kwargs)
  File "/usr/lib/python2.7/dist-packages/pandas/io/pytables.py", line 923, in write_to_group
    s.write(obj = value, append=append, complib=complib, **kwargs)
  File "/usr/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2985, in write **kwargs)
  File "/usr/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2675, in create_axes
    raise ValueError("cannot match existing table structure for [%s] on appending data" % items)
ValueError: cannot match existing table structure for [Date] on appending data

The read_csv call I use is as follows:

pd.io.parsers.read_csv(filename, sep=";|\t", compression='bz2', index_col=False, header=None, names=['XX', 'XXXX', 'Date', 'XXXXX'], parse_dates=[2], date_parser=self.parse_date, low_memory=False, iterator=True, chunksize=self.input_chunksize, dtype={'Date': np.int64})

Why would the 'Date' column of the new chunk not fit the existing colum when Iexplicitly set the dtypte to int64?

Thx for your help!

Here is the function for parsing the date:

@staticmethod
def parse_date(input_date):
       import datetime as dt
       import re

       if not re.match('\d{12}', input_date):
           input_date = '200101010101'

        timestamp = dt.datetime.strptime(input_date, '%Y%m%d%H%M')
        return timestamp

After following some of Jeff's tips I can provide further details on my problem. Here is the entire code I use to load a bz2 encoded file:

iterator_data = pd.io.parsers.read_csv(filename, sep=";|\t", compression='bz2', index_col=False, header=None,
                                               names=['XX', 'XXXX', 'Date', 'XXXXX'], parse_dates=[2],
                                               date_parser=self.parse_date, iterator=True,
                                               chunksize=self.input_chunksize, dtype={'Date': np.int64})
for chunk in iterator_data:
    self.data_store.append('huge', chunk, data_columns=True)
    self.data_store.flush()

The csv file follows the following pattern: {STRING};{STRING};{STRING}\t{INT}

The output of ptdump -av called for the output file is the following:

ptdump -av datastore.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.0',
    TITLE := '',
    VERSION := '1.0']
/huge (Group) ''
  /huge._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['XX', 'XXXX', 'Date', 'XXXXX'],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {'index': {}},
    levels := 1,
    nan_rep := 'nan',
    non_index_axes := [(1, ['XX', 'XXXX', 'Date', 'XXXXX'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_frame',
    values_cols := ['XX', 'XXXX', 'Date', 'XXXXX']]
/huge/table (Table(167135401,), shuffle, blosc(9)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "XX": StringCol(itemsize=16, shape=(), dflt='', pos=1),
  "XXXX": StringCol(itemsize=16, shape=(), dflt='', pos=2),
  "Date": Int64Col(shape=(), dflt=0, pos=3),
  "XXXXX": Int64Col(shape=(), dflt=0, pos=4)}
  byteorder := 'little'
  chunkshape := (2340,)
  autoIndex := True
  colindexes := {
    "Date": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "XXXX": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "XXXXX": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "XX": Index(6, medium, shuffle, zlib(1)).is_CSI=False}
  /huge/table._v_attrs (AttributeSet), 23 attributes:
   [XXXXX_dtype := 'int64',
    XXXXX_kind := ['XXXXX'],
    XX_dtype := 'string128',
    XX_kind := ['XX'],
    CLASS := 'TABLE',
    Date_dtype := 'datetime64',
    Date_kind := ['Date'],
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := '',
    FIELD_1_NAME := 'XX',
    FIELD_2_FILL := '',
    FIELD_2_NAME := 'XXXX',
    FIELD_3_FILL := 0,
    FIELD_3_NAME := 'Date',
    FIELD_4_FILL := 0,
    FIELD_4_NAME := 'XXXXX',
    NROWS := 167135401,
    TITLE := '',
    XXXX_dtype := 'string128',
    XXXX_kind := ['XXXX'],
    VERSION := '2.6',
    index_kind := 'integer']

After a lot of additional debugging I got to the following error:

ValueError: invalid combinate of [values_axes] on appending data [name->XXXX,cname->XXXX,dtype->int64,shape->(1, 10)] vs current table [name->XXXX,cname->XXXX,dtype->string128,shape->None]

I then tried to fix this by adding modifying the read_csv call so to force the proper type for the XXXX column but just received the same error:

dtype={'XXXX': 's64', 'Date': dt.datetime})

Is read_csv ignoring the dtype settings or what am I missing here?

When reading the data with a chunksize of 10 the last 2 chunk.info() calls give the following output:

Int64Index: 10 entries, 0 to 9
Data columns (total 4 columns):
XX         10  non-null values
XXXX       10  non-null values
Date       10  non-null values
XXXXX      10  non-null values
dtypes: datetime64[ns](1), int64(1), object(2)<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 4 columns):
XX         10  non-null values
XXXX       10  non-null values
Date       10  non-null values
XXXXX      10  non-null values
dtypes: datetime64[ns](1), int64(2), object(1)

I'm using pandas version 0.12.0.

like image 287
FrozenSUSHI Avatar asked May 20 '14 21:05

FrozenSUSHI


1 Answers

ok you have a couple of issues:

  • when specifying dtypes to pass to read_csv, they must be numpy dtypes; and string dtypes are converted to object dtype (so the s64 doesn't do anything). nor does the datetime, that's what parse_dates is used.

  • your dtypes in different chunks are DIFFERENT, that is in the first on you have 2 int64 columns and 1 object, while the 2nd has 1 int64 and 2 object. This your problem. (I think the error message might be slightly confusing, which IIRC is fixed in later versions of pandas).

So, you need to conform your dtypes in EVERY chunk to be the same. You might have mixed data in that particular column. One way to do this is to specify dtype = { column_that_is_bad : 'object' }. Another is to use convert_objects(convert_numeric=True) ON THAT column to coerce all non-numeric values to nan (this will also change the dtype of the column to float64).

like image 76
Jeff Avatar answered Sep 19 '22 23:09

Jeff