Appending Column to Frame of HDF File in Pandas

Tags:

I am working with a large dataset in CSV format. I am trying to process the data column-by-column, then append the data to a frame in an HDF file. All of this is done using Pandas. My motivation is that, while the entire dataset is much bigger than my physical memory, the column size is managable. At a later stage I will be performing feature-wise logistic regression by loading the columns back into memory one by one and operating on them.

I am able to make a new HDF file and make a new frame with the first column:

hdf_file = pandas.HDFStore('train_data.hdf')
feature_column = pandas.read_csv('data.csv', usecols=[0])
hdf_file.append('features', feature_column)

But after that, I get a ValueError when trying to append a new column to the frame:

feature_column = pandas.read_csv('data.csv', usecols=[1])
hdf_file.append('features', feature_column)

Stack trace and error message:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 658, in append self._write_to_group(key, value, table=True, append=True, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 923, in _write_to_group s.write(obj = value, append=append, complib=complib, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2985, in write **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2675, in create_axes raise ValueError("cannot match existing table structure for [%s] on appending data" % items)
ValueError: cannot match existing table structure for [srch_id] on appending data

I am new to working with large datasets and limited memory, so I am open to suggestions for alternate ways to work with this data.

326

asked Dec 06 '13 16:12

lstyls

1 Answers

complete docs are here, and some cookbook strategies here

PyTables is row-oriented, so you can only append rows. Read the csv chunk-by-chunk then append the entire frame as you go, something like this:

store = pd.HDFStore('file.h5',mode='w')
for chunk in read_csv('file.csv',chunksize=50000):
         store.append('df',chunk)
store.close()

You must be a tad careful as it is possiible for the dtypes of the resultant frrame when read chunk-by-chunk to have different dtypes, e.g. you have a integer like column that doesn't have missing values until say the 2nd chunk. The first chunk would have that column as an int64, while the second as float64. You may need to force dtypes with the dtype keyword to read_csv, see here.

here is a similar question as well.

144

answered Oct 26 '22 08:10

Jeff

Related questions
                            
                                Saving dictionary whose keys are tuples with json, python
                            
                                Python - test whether object is a builtin function
                            
                                Compile Python 2.7.3 from source on a system with Python 2.7 already
                            
                                How do I compute all possibilities for an array of numbers/bits (in python, or any language for that matter)
                            
                                Multiprocessing scikit-learn
                            
                                urllib2 HTTP error 429
                            
                                Get all the layers in a packet
                            
                                how to combine two columns with an if/else in python pandas?
                            
                                Why would a python regex compile on Linux but not Windows?
                            
                                Django: foreign key value in a list display admin
                            
                                How to print a variable with Requests and JSON
                            
                                working with HUGE lists in python
                            
                                Convert strings to int or float in Python 3?
                            
                                Python: Implementation of shallow and deep copy constructors
                            
                                Need help getting started with Boost.Python
                            
                                Python 3 replacement for deprecated compiler.ast flatten function
                            
                                How to insert a datetime into a Cassandra 1.2 timestamp column
                            
                                Odd SciPy ODE Integration error
                            
                                transform the upper/lower triangular part of a symmetric matrix (2D array) into a 1D array and return it to the 2D format
                            
                                How to wait some time in pygame?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Appending Column to Frame of HDF File in Pandas

Tags:

python

pandas

csv

hdf5

lstyls

People also ask

1 Answers

Jeff

Recent Activity

Donate For Us