Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iteratively writing to HDF5 Stores in Pandas

Tags:

Pandas has the following examples for how to store Series, DataFrames and Panelsin HDF5 files:

Prepare some data:

In [1142]: store = HDFStore('store.h5')  In [1143]: index = date_range('1/1/2000', periods=8)  In [1144]: s = Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])  In [1145]: df = DataFrame(randn(8, 3), index=index,    ......:                columns=['A', 'B', 'C'])    ......:  In [1146]: wp = Panel(randn(2, 5, 4), items=['Item1', 'Item2'],    ......:            major_axis=date_range('1/1/2000', periods=5),    ......:            minor_axis=['A', 'B', 'C', 'D'])    ......: 

Save it in a store:

In [1147]: store['s'] = s  In [1148]: store['df'] = df  In [1149]: store['wp'] = wp 

Inspect what's in the store:

In [1150]: store Out[1150]:  <class 'pandas.io.pytables.HDFStore'> File path: store.h5 /df            frame        (shape->[8,3])   /s             series       (shape->[5])     /wp            wide         (shape->[2,5,4]) 

Close the store:

In [1151]: store.close() 

Questions:

  1. In the code above, when is the data actually written to disk?

  2. Say I want to add thousands of large dataframes living in .csv files to a single .h5 file. I would need to load them and add them to the .h5 file one by one since I cannot afford to have them all in memory at once as they would take too much memory. Is this possible with HDF5? What would be the correct way to do it?

  3. The Pandas documentation says the following:

    "These stores are not appendable once written (though you simply remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety."

    What does it mean by not appendable nor queryable? Also, shouldn't it say once closed instead of written?

like image 621
Amelio Vazquez-Reina Avatar asked May 19 '13 17:05

Amelio Vazquez-Reina


People also ask

How do I save pandas DataFrame to HDF5?

Exporting a pandas DataFrame to a HDF5 file: The method to_hdf() exports a pandas DataFrame object to a HDF5 File. The HDF5 group under which the pandas DataFrame has to be stored is specified through the parameter key.

Can pandas read HDF5?

Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format.

Is HDF5 faster than CSV?

(a) Categorical Features as Strings An interesting observation here is that hdf shows even slower loading speed that the csv one while other binary formats perform noticeably better.

How do I read a Textframe file in pandas?

We can read data from a text file using read_table() in pandas. This function reads a general delimited file to a DataFrame object. This function is essentially the same as the read_csv() function but with the delimiter = '\t', instead of a comma by default.


1 Answers

  1. As soon as the statement is exectued, eg store['df'] = df. The close just closes the actual file (which will be closed for you if the process exists, but will print a warning message)

  2. Read the section http://pandas.pydata.org/pandas-docs/dev/io.html#storing-in-table-format

    It is generally not a good idea to put a LOT of nodes in an .h5 file. You probably want to append and create a smaller number of nodes.

    You can just iterate thru your .csv and store/append them one by one. Something like:

    for f in files:   df = pd.read_csv(f)   df.to_hdf('file.h5',f,df) 

    Would be one way (creating a separate node for each file)

  3. Not appendable - once you write it, you can only retrieve it all at once, e.g. you cannot select a sub-section

    If you have a table, then you can do things like:

    pd.read_hdf('my_store.h5','a_table_node',['index>100']) 

    which is like a database query, only getting part of the data

    Thus, a store is not appendable, nor queryable, while a table is both.

like image 113
Jeff Avatar answered Sep 30 '22 06:09

Jeff