Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas read_hdf query by date and time range

I have a question regarding how to filter results in the pd.read_hdf function. So here's the setup, I have a pandas dataframe (with np.datetime64 index) which I put into a hdf5 file. There's nothing fancy going on here, so no use of hierarchy or anything (maybe I could incorporate it?). Here's an example:

                              Foo          Bar
TIME                                         
2014-07-14 12:02:00            0            0
2014-07-14 12:03:00            0            0
2014-07-14 12:04:00            0            0
2014-07-14 12:05:00            0            0
2014-07-14 12:06:00            0            0
2014-07-15 12:02:00            0            0
2014-07-15 12:03:00            0            0
2014-07-15 12:04:00            0            0
2014-07-15 12:05:00            0            0
2014-07-15 12:06:00            0            0
2014-07-16 12:02:00            0            0
2014-07-16 12:03:00            0            0
2014-07-16 12:04:00            0            0
2014-07-16 12:05:00            0            0
2014-07-16 12:06:00            0            0

Now I store this into a .h5 using the following command:

store = pd.HDFStore('qux.h5')
#generate df
store.append('data', df)
store.close()

Next, I'll have another process which accesses this data and I would like to take date/time slices of this data. So suppose I want dates between 2014-07-14 and 2014-07-15, and only for times between 12:02:00 and 12:04:00. Currently I am using the following command to retrieve this:

pd.read_hdf('qux.h5', 'data', where='index >= 20140714 and index <= 20140715').between_time(start_time=datetime.time(12,2), end_time=datetime.time(12,4))

As far as I'm aware, someone please correct me if I'm wrong here, but entire original dataset is not read into memory if I use 'where'. So in other words:

This:

pd.read_hdf('qux.h5', 'data', where='index >= 20140714 and index <= 20140715')

Is not the same as this:

pd.read_hdf('qux.h5', 'data')['20140714':'20140715']

While the end result is exactly the same, what's being done in the background is not. So my question is, is there a way to incorporate that time range filter (i.e. .between_time()) into my where statement? Or if there's another way I should structure my hdf5 file? Maybe store a table for each day?

Thanks!

EDIT:

Regarding using hierarchy, I'm aware that the structure should be highly dependent on how I'll be using the data. However, if we assume that the I define a table per date (e.g. 'df/date_20140714', 'df/date_20140715', ...). Again I may be mistaken here, but using my example of querying date/time range; I'll probably incur a performance penalty as I'll need to read each table and have to merge them if I want a consolidated output right?

like image 952
ast4 Avatar asked Jan 09 '23 21:01

ast4


1 Answers

See an example of selecting using a where mask

Here's an example

In [50]: pd.set_option('max_rows',10)

In [51]: df = DataFrame(np.random.randn(1000,2),index=date_range('20130101',periods=1000,freq='H'))

In [52]: df
Out[52]: 
                            0         1
2013-01-01 00:00:00 -0.467844  1.038375
2013-01-01 01:00:00  0.057419  0.914379
2013-01-01 02:00:00 -1.378131  0.187081
2013-01-01 03:00:00  0.398765 -0.122692
2013-01-01 04:00:00  0.847332  0.967856
...                       ...       ...
2013-02-11 11:00:00  0.554420  0.777484
2013-02-11 12:00:00 -0.558041  1.833465
2013-02-11 13:00:00 -0.786312  0.501893
2013-02-11 14:00:00 -0.280538  0.680498
2013-02-11 15:00:00  1.533521 -1.992070

[1000 rows x 2 columns]

In [53]: store = pd.HDFStore('test.h5',mode='w')

In [54]: store.append('df',df)

In [55]: c = store.select_column('df','index')

In [56]: where = pd.DatetimeIndex(c).indexer_between_time('12:30','4:00')

In [57]: store.select('df',where=where)
Out[57]: 
                            0         1
2013-01-01 00:00:00 -0.467844  1.038375
2013-01-01 01:00:00  0.057419  0.914379
2013-01-01 02:00:00 -1.378131  0.187081
2013-01-01 03:00:00  0.398765 -0.122692
2013-01-01 04:00:00  0.847332  0.967856
...                       ...       ...
2013-02-11 03:00:00  0.902023  1.416775
2013-02-11 04:00:00 -1.455099 -0.766558
2013-02-11 13:00:00 -0.786312  0.501893
2013-02-11 14:00:00 -0.280538  0.680498
2013-02-11 15:00:00  1.533521 -1.992070

[664 rows x 2 columns]

In [58]: store.close()

Couple of points to note. This reads in the entire index to start. Usually this is not a burden. If it is you can just chunk read it (provide start/stop, though its a bit manual to do this ATM). Current select_column I don't believe can accept a query either.

You could potentially iterate over the days (and do individual queries) if you have a gargantuan amount of data (tens of millions of rows, which are wide), which might be more efficient.

Recombing data is relatively cheap (via concat), so don't be afraid to sub-query (though doing this too much can drag perf as well).

like image 74
Jeff Avatar answered Jan 21 '23 04:01

Jeff