The final solution was to use the "converters" parameter of read_csv and check every value before adding it to the DataFrame. In the end there were only 2 broken values in over 80GB of raw data.
The parameter looks like this:
converters={'XXXXX': self.parse_xxxxx}
And the small static helper method like this:
@staticmethod
def parse_xxxxx(input):
if not isinstance(input, float):
try:
return float(input)
except ValueError:
print "Broken Value: ", input
return float(0.0)
else:
return input
While trying to read ca. 40GB+ of csv data into a HDF file I ran into a confusing problem. After reading about 1GB the entire process fails with the following error
File "/usr/lib/python2.7/dist-packages/pandas/io/pytables.py", line 658, in append
self._write_to_group(key, value, table=True, append=True, **kwargs)
File "/usr/lib/python2.7/dist-packages/pandas/io/pytables.py", line 923, in write_to_group
s.write(obj = value, append=append, complib=complib, **kwargs)
File "/usr/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2985, in write **kwargs)
File "/usr/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2675, in create_axes
raise ValueError("cannot match existing table structure for [%s] on appending data" % items)
ValueError: cannot match existing table structure for [Date] on appending data
The read_csv call I use is as follows:
pd.io.parsers.read_csv(filename, sep=";|\t", compression='bz2', index_col=False, header=None, names=['XX', 'XXXX', 'Date', 'XXXXX'], parse_dates=[2], date_parser=self.parse_date, low_memory=False, iterator=True, chunksize=self.input_chunksize, dtype={'Date': np.int64})
Why would the 'Date' column of the new chunk not fit the existing colum when Iexplicitly set the dtypte to int64?
Thx for your help!
Here is the function for parsing the date:
@staticmethod
def parse_date(input_date):
import datetime as dt
import re
if not re.match('\d{12}', input_date):
input_date = '200101010101'
timestamp = dt.datetime.strptime(input_date, '%Y%m%d%H%M')
return timestamp
After following some of Jeff's tips I can provide further details on my problem. Here is the entire code I use to load a bz2 encoded file:
iterator_data = pd.io.parsers.read_csv(filename, sep=";|\t", compression='bz2', index_col=False, header=None,
names=['XX', 'XXXX', 'Date', 'XXXXX'], parse_dates=[2],
date_parser=self.parse_date, iterator=True,
chunksize=self.input_chunksize, dtype={'Date': np.int64})
for chunk in iterator_data:
self.data_store.append('huge', chunk, data_columns=True)
self.data_store.flush()
The csv file follows the following pattern: {STRING};{STRING};{STRING}\t{INT}
The output of ptdump -av called for the output file is the following:
ptdump -av datastore.h5
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.0',
TITLE := '',
VERSION := '1.0']
/huge (Group) ''
/huge._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['XX', 'XXXX', 'Date', 'XXXXX'],
encoding := None,
index_cols := [(0, 'index')],
info := {'index': {}},
levels := 1,
nan_rep := 'nan',
non_index_axes := [(1, ['XX', 'XXXX', 'Date', 'XXXXX'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_frame',
values_cols := ['XX', 'XXXX', 'Date', 'XXXXX']]
/huge/table (Table(167135401,), shuffle, blosc(9)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"XX": StringCol(itemsize=16, shape=(), dflt='', pos=1),
"XXXX": StringCol(itemsize=16, shape=(), dflt='', pos=2),
"Date": Int64Col(shape=(), dflt=0, pos=3),
"XXXXX": Int64Col(shape=(), dflt=0, pos=4)}
byteorder := 'little'
chunkshape := (2340,)
autoIndex := True
colindexes := {
"Date": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"XXXX": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"XXXXX": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"XX": Index(6, medium, shuffle, zlib(1)).is_CSI=False}
/huge/table._v_attrs (AttributeSet), 23 attributes:
[XXXXX_dtype := 'int64',
XXXXX_kind := ['XXXXX'],
XX_dtype := 'string128',
XX_kind := ['XX'],
CLASS := 'TABLE',
Date_dtype := 'datetime64',
Date_kind := ['Date'],
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := '',
FIELD_1_NAME := 'XX',
FIELD_2_FILL := '',
FIELD_2_NAME := 'XXXX',
FIELD_3_FILL := 0,
FIELD_3_NAME := 'Date',
FIELD_4_FILL := 0,
FIELD_4_NAME := 'XXXXX',
NROWS := 167135401,
TITLE := '',
XXXX_dtype := 'string128',
XXXX_kind := ['XXXX'],
VERSION := '2.6',
index_kind := 'integer']
After a lot of additional debugging I got to the following error:
ValueError: invalid combinate of [values_axes] on appending data [name->XXXX,cname->XXXX,dtype->int64,shape->(1, 10)] vs current table [name->XXXX,cname->XXXX,dtype->string128,shape->None]
I then tried to fix this by adding modifying the read_csv call so to force the proper type for the XXXX column but just received the same error:
dtype={'XXXX': 's64', 'Date': dt.datetime})
Is read_csv ignoring the dtype settings or what am I missing here?
When reading the data with a chunksize of 10 the last 2 chunk.info() calls give the following output:
Int64Index: 10 entries, 0 to 9
Data columns (total 4 columns):
XX 10 non-null values
XXXX 10 non-null values
Date 10 non-null values
XXXXX 10 non-null values
dtypes: datetime64[ns](1), int64(1), object(2)<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 4 columns):
XX 10 non-null values
XXXX 10 non-null values
Date 10 non-null values
XXXXX 10 non-null values
dtypes: datetime64[ns](1), int64(2), object(1)
I'm using pandas version 0.12.0.
ok you have a couple of issues:
when specifying dtypes to pass to read_csv
, they must be numpy dtypes; and string dtypes are converted to object
dtype (so the s64
doesn't do anything). nor does the datetime
, that's what parse_dates
is used.
your dtypes in different chunks are DIFFERENT, that is in the first on you have 2 int64
columns and 1 object
, while the 2nd has 1 int64
and 2 object
. This your problem. (I think the error message might be slightly confusing, which IIRC is fixed in later versions of pandas).
So, you need to conform your dtypes in EVERY chunk to be the same. You might have mixed data in that particular column. One way to do this is to specify dtype = { column_that_is_bad : 'object' }
. Another is to use convert_objects(convert_numeric=True)
ON THAT column to coerce all non-numeric values to nan
(this will also change the dtype of the column to float64
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With