Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to subset pandas time series by time of day

Tags:

python

pandas

I am trying to subset a pandas time series that spans multiple days by time of day. E.g., I only want times between 12:00 and 13:00.

I know how to do this for a specific date, e.g.,

In [44]: type(test)
Out[44]: pandas.core.frame.DataFrame

In [23]: test
Out[23]:
                           col1
timestamp
2012-01-14 11:59:56+00:00     3
2012-01-14 11:59:57+00:00     3
2012-01-14 11:59:58+00:00     3
2012-01-14 11:59:59+00:00     3
2012-01-14 12:00:00+00:00     3
2012-01-14 12:00:01+00:00     3
2012-01-14 12:00:02+00:00     3

In [30]: test['2012-01-14 12:00:00' : '2012-01-14 13:00']
Out[30]:
                           col1
timestamp 
2012-01-14 12:00:00+00:00     3
2012-01-14 12:00:01+00:00     3
2012-01-14 12:00:02+00:00     3

But I have failed to do it for any date using test.index.hour or test.index.indexer_between_time() which were both suggested as answers to similar questions. I tried the following:

In [44]: type(test)
Out[44]: pandas.core.frame.DataFrame

In [34]: test[(test.index.hour >= 12) & (test.index.hour < 13)]
Out[34]:
Empty DataFrame
Columns: [col1]
Index: []

In [36]: import datetime as dt
In [37]: test.index.indexer_between_time(dt.time(12),dt.time(13))
Out[37]: array([], dtype=int64)

For the first approach, I have no idea what test.index.hour or test.index.minute are actually returning:

In [41]: test.index
Out[41]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-14 11:59:56, ..., 2012-01-14 12:00:02]
Length: 7, Freq: None, Timezone: tzlocal()

In [42]: test.index.hour
Out[42]: array([11, 23,  0,  0,  0,  0,  0], dtype=int32)

In [43]: test.index.minute
Out[43]: array([59, 50,  0,  0, 50, 50,  0], dtype=int32)

What are they returning? How can I do the desired subsetting? Ideally, how can I get both the two approaches above to work?

Edit: The problem turned out to be the the index was invalid, which is evidenced by Timezone: tzlocal() above, as tzlocal() should not be allowed as timezone. When I changed my method of generating the index to pd.to_datetime(), according to the final part of the accepted answer, everything worked as expected.

like image 492
Rahul Savani Avatar asked Feb 07 '14 10:02

Rahul Savani


1 Answers

Assuming the index is a valid pandas timestamp, the following will work:

test.index.hour returns an array containing the hours for each row in your dataframe. Ex:

df = pd.DataFrame(randn(100000,1),columns=['A'],index=pd.date_range('20130101',periods=100000,freq='T'))

df.index.year returns array([2013, 2013, 2013, ..., 2013, 2013, 2013])

To grab all rows where the time is between 12 and 1, use

df.between_time('12:00','13:00')

This will grab that timeframe over several days/years etc. If the index is not a valid timestamp, convert it to a valid timestamp using pd.to_datetime()

like image 104
David Hagan Avatar answered Oct 20 '22 05:10

David Hagan