Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Appending Column to Frame of HDF File in Pandas

I am working with a large dataset in CSV format. I am trying to process the data column-by-column, then append the data to a frame in an HDF file. All of this is done using Pandas. My motivation is that, while the entire dataset is much bigger than my physical memory, the column size is managable. At a later stage I will be performing feature-wise logistic regression by loading the columns back into memory one by one and operating on them.

I am able to make a new HDF file and make a new frame with the first column:

hdf_file = pandas.HDFStore('train_data.hdf')
feature_column = pandas.read_csv('data.csv', usecols=[0])
hdf_file.append('features', feature_column)

But after that, I get a ValueError when trying to append a new column to the frame:

feature_column = pandas.read_csv('data.csv', usecols=[1])
hdf_file.append('features', feature_column)

Stack trace and error message:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 658, in append self._write_to_group(key, value, table=True, append=True, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 923, in _write_to_group s.write(obj = value, append=append, complib=complib, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2985, in write **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2675, in create_axes raise ValueError("cannot match existing table structure for [%s] on appending data" % items)
ValueError: cannot match existing table structure for [srch_id] on appending data

I am new to working with large datasets and limited memory, so I am open to suggestions for alternate ways to work with this data.

like image 326
lstyls Avatar asked Dec 06 '13 16:12

lstyls


People also ask

What is HDF in pandas?

Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.

What is HDFStore?

HDF file stores meta data about the data stored in it so that any application can interpret the content and structure of the file. Pandas implements HDFStore interface to read, write, append, select a HDF file.

What is key in To_hdf?

In pandas to_hdf, the 'key' parameter is the name of the object you are storing in the hdf5 file. You can store multiple objects (dataframes) in a single hdf5 file.

What is HDF in Python?

HDF5 file stands for Hierarchical Data Format 5. It is an open-source file which comes in handy to store large amount of data. As the name suggests, it stores data in a hierarchical structure within a single file.


1 Answers

complete docs are here, and some cookbook strategies here

PyTables is row-oriented, so you can only append rows. Read the csv chunk-by-chunk then append the entire frame as you go, something like this:

store = pd.HDFStore('file.h5',mode='w')
for chunk in read_csv('file.csv',chunksize=50000):
         store.append('df',chunk)
store.close()

You must be a tad careful as it is possiible for the dtypes of the resultant frrame when read chunk-by-chunk to have different dtypes, e.g. you have a integer like column that doesn't have missing values until say the 2nd chunk. The first chunk would have that column as an int64, while the second as float64. You may need to force dtypes with the dtype keyword to read_csv, see here.

here is a similar question as well.

like image 144
Jeff Avatar answered Oct 26 '22 08:10

Jeff