<p>Pandas has the following examples for how to store <code>Series</code>, <code>DataFrames</code> and <code>Panels</code>in HDF5 files:</p> <h3>Prepare some data:</h3> <pre class="prettyprint"><code>In [1142]: store = HDFStore('store.h5') In [1143]: index = date_range('1/1/2000', periods=8) In [1144]: s = Series(randn(5), index=['a', 'b', 'c', 'd', 'e']) In [1145]: df = DataFrame(randn(8, 3), index=index, ......: columns=['A', 'B', 'C']) ......: In [1146]: wp = Panel(randn(2, 5, 4), items=['Item1', 'Item2'], ......: major_axis=date_range('1/1/2000', periods=5), ......: minor_axis=['A', 'B', 'C', 'D']) ......: </code></pre> <h3>Save it in a store:</h3> <pre class="prettyprint"><code>In [1147]: store['s'] = s In [1148]: store['df'] = df In [1149]: store['wp'] = wp </code></pre> <h3>Inspect what's in the store:</h3> <pre class="prettyprint"><code>In [1150]: store Out[1150]: <class 'pandas.io.pytables.HDFStore'> File path: store.h5 /df frame (shape->[8,3]) /s series (shape->[5]) /wp wide (shape->[2,5,4]) </code></pre> <h3>Close the store:</h3> <pre class="prettyprint"><code>In [1151]: store.close() </code></pre> <h3>Questions:</h3> <ol> <li><p>In the code above, <strong>when is the data actually written to disk</strong>? </p></li> <li><p>Say I want to add thousands of large dataframes living in <code>.csv</code> files to a single <code>.h5</code> file. I would need to load them and add them to the <code>.h5</code> file one by one since I <strong>cannot</strong> afford to have them all in memory at once as they would take too much memory. Is this possible with HDF5? What would be the correct way to do it?</p></li> <li> <p>The Pandas documentation says the following: </p> <blockquote> <p>"These stores are not appendable <strong>once written</strong> (though you simply remove them and rewrite). Nor are they <strong>queryable</strong>; they must be retrieved in their entirety."</p> </blockquote> <p>What does it mean by <strong>not appendable nor queryable</strong>? Also, shouldn't it say once <strong>closed</strong> instead of <strong>written</strong>?</p> </li> </ol>

<ol> <li><p>As soon as the statement is exectued, eg <code>store['df'] = df</code>. The <code>close</code> just closes the actual file (which will be closed for you if the process exists, but will print a warning message)</p></li> <li> <p>Read the section http://pandas.pydata.org/pandas-docs/dev/io.html#storing-in-table-format</p> <p>It is generally not a good idea to put a LOT of nodes in an <code>.h5</code> file. You probably want to append and create a smaller number of nodes.</p> <p>You can just iterate thru your <code>.csv</code> and <code>store/append</code> them one by one. Something like:</p> <pre class="prettyprint"><code>for f in files: df = pd.read_csv(f) df.to_hdf('file.h5',f,df) </code></pre> <p>Would be one way (creating a separate node for each file)</p> </li> <li> <p>Not appendable - once you write it, you can only retrieve it all at once, e.g. you cannot select a sub-section</p> <p>If you have a table, then you can do things like:</p> <pre class="prettyprint"><code>pd.read_hdf('my_store.h5','a_table_node',['index>100']) </code></pre> <p>which is like a database query, only getting part of the data</p> <p><strong>Thus, a store is not appendable, nor queryable, while a table is both</strong>.</p> </li> </ol>

Iteratively writing to HDF5 Stores in Pandas

Tags:

Pandas has the following examples for how to store Series, DataFrames and Panelsin HDF5 files:

Prepare some data:

In [1142]: store = HDFStore('store.h5')  In [1143]: index = date_range('1/1/2000', periods=8)  In [1144]: s = Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])  In [1145]: df = DataFrame(randn(8, 3), index=index,    ......:                columns=['A', 'B', 'C'])    ......:  In [1146]: wp = Panel(randn(2, 5, 4), items=['Item1', 'Item2'],    ......:            major_axis=date_range('1/1/2000', periods=5),    ......:            minor_axis=['A', 'B', 'C', 'D'])    ......:

Save it in a store:

In [1147]: store['s'] = s  In [1148]: store['df'] = df  In [1149]: store['wp'] = wp

Inspect what's in the store:

In [1150]: store Out[1150]:  <class 'pandas.io.pytables.HDFStore'> File path: store.h5 /df            frame        (shape->[8,3])   /s             series       (shape->[5])     /wp            wide         (shape->[2,5,4])

Close the store:

In [1151]: store.close()

Questions:

In the code above, when is the data actually written to disk?
Say I want to add thousands of large dataframes living in .csv files to a single .h5 file. I would need to load them and add them to the .h5 file one by one since I cannot afford to have them all in memory at once as they would take too much memory. Is this possible with HDF5? What would be the correct way to do it?
The Pandas documentation says the following:

"These stores are not appendable once written (though you simply remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety."

What does it mean by not appendable nor queryable? Also, shouldn't it say once closed instead of written?

621

asked May 19 '13 17:05

Amelio Vazquez-Reina

1 Answers

As soon as the statement is exectued, eg store['df'] = df. The close just closes the actual file (which will be closed for you if the process exists, but will print a warning message)
Read the section http://pandas.pydata.org/pandas-docs/dev/io.html#storing-in-table-format

It is generally not a good idea to put a LOT of nodes in an .h5 file. You probably want to append and create a smaller number of nodes.

You can just iterate thru your .csv and store/append them one by one. Something like:
```
for f in files:   df = pd.read_csv(f)   df.to_hdf('file.h5',f,df) 
```
Would be one way (creating a separate node for each file)
Not appendable - once you write it, you can only retrieve it all at once, e.g. you cannot select a sub-section

If you have a table, then you can do things like:
```
pd.read_hdf('my_store.h5','a_table_node',['index>100']) 
```
which is like a database query, only getting part of the data

Thus, a store is not appendable, nor queryable, while a table is both.

113

answered Sep 30 '22 06:09

Jeff

Related questions
                            
                                Is using std::async many times for small tasks performance friendly?
                            
                                List of well-known GUIDs [closed]
                            
                                Where is "Hutton's Razor" first defined?
                            
                                Does each thread have its own stack?
                            
                                The State of Optional Static Typing in Python?
                            
                                I have Registered App ID but still it shows You have no eligible Bundle IDs for iOS apps. Register one here
                            
                                Multiple executable jar files with different external dependencies from a single project with sbt-assembly
                            
                                Android service crashes after app is swiped out of the recent apps list
                            
                                Use jasmine to test Express.js
                            
                                check variadic templates parameters for uniqueness
                            
                                RootProject and ProjectRef
                            
                                How to perform Auto crop for document Recognize image using camera? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With