I have a question regarding how to filter results in the pd.read_hdf function. So here's the setup, I have a pandas dataframe (with np.datetime64 index) which I put into a hdf5 file. There's nothing fancy going on here, so no use of hierarchy or anything (maybe I could incorporate it?). Here's an example:
Foo Bar
TIME
2014-07-14 12:02:00 0 0
2014-07-14 12:03:00 0 0
2014-07-14 12:04:00 0 0
2014-07-14 12:05:00 0 0
2014-07-14 12:06:00 0 0
2014-07-15 12:02:00 0 0
2014-07-15 12:03:00 0 0
2014-07-15 12:04:00 0 0
2014-07-15 12:05:00 0 0
2014-07-15 12:06:00 0 0
2014-07-16 12:02:00 0 0
2014-07-16 12:03:00 0 0
2014-07-16 12:04:00 0 0
2014-07-16 12:05:00 0 0
2014-07-16 12:06:00 0 0
Now I store this into a .h5 using the following command:
store = pd.HDFStore('qux.h5')
#generate df
store.append('data', df)
store.close()
Next, I'll have another process which accesses this data and I would like to take date/time slices of this data. So suppose I want dates between 2014-07-14 and 2014-07-15, and only for times between 12:02:00 and 12:04:00. Currently I am using the following command to retrieve this:
pd.read_hdf('qux.h5', 'data', where='index >= 20140714 and index <= 20140715').between_time(start_time=datetime.time(12,2), end_time=datetime.time(12,4))
As far as I'm aware, someone please correct me if I'm wrong here, but entire original dataset is not read into memory if I use 'where'. So in other words:
This:
pd.read_hdf('qux.h5', 'data', where='index >= 20140714 and index <= 20140715')
Is not the same as this:
pd.read_hdf('qux.h5', 'data')['20140714':'20140715']
While the end result is exactly the same, what's being done in the background is not. So my question is, is there a way to incorporate that time range filter (i.e. .between_time()) into my where statement? Or if there's another way I should structure my hdf5 file? Maybe store a table for each day?
Thanks!
EDIT:
Regarding using hierarchy, I'm aware that the structure should be highly dependent on how I'll be using the data. However, if we assume that the I define a table per date (e.g. 'df/date_20140714', 'df/date_20140715', ...). Again I may be mistaken here, but using my example of querying date/time range; I'll probably incur a performance penalty as I'll need to read each table and have to merge them if I want a consolidated output right?
See an example of selecting using a where mask
Here's an example
In [50]: pd.set_option('max_rows',10)
In [51]: df = DataFrame(np.random.randn(1000,2),index=date_range('20130101',periods=1000,freq='H'))
In [52]: df
Out[52]:
0 1
2013-01-01 00:00:00 -0.467844 1.038375
2013-01-01 01:00:00 0.057419 0.914379
2013-01-01 02:00:00 -1.378131 0.187081
2013-01-01 03:00:00 0.398765 -0.122692
2013-01-01 04:00:00 0.847332 0.967856
... ... ...
2013-02-11 11:00:00 0.554420 0.777484
2013-02-11 12:00:00 -0.558041 1.833465
2013-02-11 13:00:00 -0.786312 0.501893
2013-02-11 14:00:00 -0.280538 0.680498
2013-02-11 15:00:00 1.533521 -1.992070
[1000 rows x 2 columns]
In [53]: store = pd.HDFStore('test.h5',mode='w')
In [54]: store.append('df',df)
In [55]: c = store.select_column('df','index')
In [56]: where = pd.DatetimeIndex(c).indexer_between_time('12:30','4:00')
In [57]: store.select('df',where=where)
Out[57]:
0 1
2013-01-01 00:00:00 -0.467844 1.038375
2013-01-01 01:00:00 0.057419 0.914379
2013-01-01 02:00:00 -1.378131 0.187081
2013-01-01 03:00:00 0.398765 -0.122692
2013-01-01 04:00:00 0.847332 0.967856
... ... ...
2013-02-11 03:00:00 0.902023 1.416775
2013-02-11 04:00:00 -1.455099 -0.766558
2013-02-11 13:00:00 -0.786312 0.501893
2013-02-11 14:00:00 -0.280538 0.680498
2013-02-11 15:00:00 1.533521 -1.992070
[664 rows x 2 columns]
In [58]: store.close()
Couple of points to note. This reads in the entire index to start. Usually this is not a burden. If it is you can just chunk read it (provide start/stop, though its a bit manual to do this ATM). Current select_column
I don't believe can accept a query either.
You could potentially iterate over the days (and do individual queries) if you have a gargantuan amount of data (tens of millions of rows, which are wide), which might be more efficient.
Recombing data is relatively cheap (via concat
), so don't be afraid to sub-query (though doing this too much can drag perf as well).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With