how to create a group ID based on 5 minutes interval in pandas timeseries?

Tags:

I have a timeseries dataframe df looks like this (the time seris happen within same day, but across different hours:

                                id               val 
 time                    
2014-04-03 16:01:53             23              14389      
2014-04-03 16:01:54             28              14391             
2014-04-03 16:05:55             24              14393             
2014-04-03 16:06:25             23              14395             
2014-04-03 16:07:01             23              14395             
2014-04-03 16:10:09             23              14395             
2014-04-03 16:10:23             26              14397             
2014-04-03 16:10:57             26              14397             
2014-04-03 16:11:10             26              14397

I need to create group every 5 minutes from starting from 16:00:00. That is all the rows with in the range 16:00:00 to 16:05:00, its value of the new column period is 1. (the number of rows within each group is irregular, so i can't simply cut the group)

Eventually, the data should look like this:

                                id               val           period 
time            
2014-04-03 16:01:53             23              14389             1
2014-04-03 16:01:54             28              14391             1
2014-04-03 16:05:55             24              14393             2
2014-04-03 16:06:25             23              14395             2
2014-04-03 16:07:01             23              14395             2
2014-04-03 16:10:09             23              14395             3
2014-04-03 16:10:23             26              14397             3
2014-04-03 16:10:57             26              14397             3
2014-04-03 16:11:10             26              14397             3

The purpose is to perform some groupby operation, but the operation I need to do is not included in pd.resample(how=' ') method. So I have to create a period column to identify each group, then do df.groupby('period').apply(myfunc).

Any help or comments are highly appreciated.

Thanks!

537

asked May 31 '14 03:05

user3576212

2 Answers

You can use the TimeGrouper function in a groupy/apply. With a TimeGrouper you don't need to create your period column. I know you're not trying to compute the mean but I will use it as an example:

>>> df.groupby(pd.TimeGrouper('5Min'))['val'].mean()

time
2014-04-03 16:00:00    14390.000000
2014-04-03 16:05:00    14394.333333
2014-04-03 16:10:00    14396.500000

Or an example with an explicit apply:

>>> df.groupby(pd.TimeGrouper('5Min'))['val'].apply(lambda x: len(x) > 3)

time
2014-04-03 16:00:00    False
2014-04-03 16:05:00    False
2014-04-03 16:10:00     True

Doctstring for TimeGrouper:

Docstring for resample:class TimeGrouper@21

TimeGrouper(self, freq = 'Min', closed = None, label = None,
how = 'mean', nperiods = None, axis = 0, fill_method = None,
limit = None, loffset = None, kind = None, convention = None, base = 0,
**kwargs)

Custom groupby class for time-interval grouping

Parameters
----------
freq : pandas date offset or offset alias for identifying bin edges
closed : closed end of interval; left or right
label : interval boundary to use for labeling; left or right
nperiods : optional, integer
convention : {'start', 'end', 'e', 's'}
    If axis is PeriodIndex

Notes
-----
Use begin, end, nperiods to generate intervals that cannot be derived
directly from the associated object

Edit

I don't know of an elegant way to create the period column, but the following will work:

>>> new = df.groupby(pd.TimeGrouper('5Min'),as_index=False).apply(lambda x: x['val'])
>>> df['period'] = new.index.get_level_values(0)
>>> df

                     id    val  period
time
2014-04-03 16:01:53  23  14389       0
2014-04-03 16:01:54  28  14391       0 
2014-04-03 16:05:55  24  14393       1
2014-04-03 16:06:25  23  14395       1
2014-04-03 16:07:01  23  14395       1
2014-04-03 16:10:09  23  14395       2
2014-04-03 16:10:23  26  14397       2
2014-04-03 16:10:57  26  14397       2
2014-04-03 16:11:10  26  14397       2

It works because the groupby here with as_index=False actually returns the period column you want as the part of the multiindex and I just grab that part of the multiindex and assign to a new column in the orginal dataframe. You could do anything in the apply, I just want the index:

>>> new

   time
0  2014-04-03 16:01:53    14389
   2014-04-03 16:01:54    14391
1  2014-04-03 16:05:55    14393
   2014-04-03 16:06:25    14395
   2014-04-03 16:07:01    14395
2  2014-04-03 16:10:09    14395
   2014-04-03 16:10:23    14397
   2014-04-03 16:10:57    14397
   2014-04-03 16:11:10    14397

>>>  new.index.get_level_values(0)

Int64Index([0, 0, 1, 1, 1, 2, 2, 2, 2], dtype='int64')

169

answered Oct 23 '22 09:10

Karl D.

Depending on what your doing if I understand the question right can be done a lot more easily just using the resample method

#Get some data
index = pd.DatetimeIndex(start='2013-01-01 00:00', end='2013-01-31 00:00', freq='min')
a = np.random.randint(20, high=30, size=(len(index),1))
b = np.random.randint(14440, high=14449, size=(len(index),1))
df = pd.DataFrame(np.concatenate((a,b), axis=1), index=index, columns=['id','val'])
df.head()


Out[34]:
                     id  val
2013-01-01 00:00:00  20  14446
2013-01-01 00:01:00  25  14443
2013-01-01 00:02:00  25  14448
2013-01-01 00:03:00  20  14445
2013-01-01 00:04:00  28  14442

#Define function for variance
import numpy as np
def pyfun(X):

    if X.shape[0] <= 1:
        result = nan

    else:    
        total = 0
        for x in X:
            total = total + x
        mean = float(total) / X.shape[0]

        total = 0
        for x in X:
            total = total + (mean-x)**2
        result = float(total) / (X.shape[0]-1)

    return result

#Try it out
df.resample('5min', how=pyfun)


Out[53]:
                     id val
2013-01-01 00:00:00  12.3    5.7
2013-01-01 00:05:00  9.3     7.3
2013-01-01 00:10:00  4.7     0.8
2013-01-01 00:15:00  10.8    10.3
2013-01-01 00:20:00  11.5    1.5

Well that was easy. This is for your own functions but if you want to use a function from a library then all you need to do is specify the function in the how keyword

df.resample('5min', how=np.var).head()


Out[54]:
                     id val
2013-01-01 00:00:00  12.3    5.7
2013-01-01 00:05:00  9.3     7.3
2013-01-01 00:10:00  4.7     0.8
2013-01-01 00:15:00  10.8    10.3
2013-01-01 00:20:00  11.5    1.5

answered Oct 23 '22 10:10

pbreach

Related questions
                            
                                Images not showing in iOS 7.1 with XCode 6 / Swift
                            
                                AngularJS - load google map script async in directive for multiple maps
                            
                                Adding Items to and Querying the iOS Keychain with Swift
                            
                                How Neo4j stores data internally?
                            
                                HttpContext.Current is null in an asynchronous Callback
                            
                                Incompatible pointer types passing in _Generic macro
                            
                                Get ISO 8601 using Intl.DateTimeFormat
                            
                                Obtaining EntityManager in Spring + Hibernate configuration
                            
                                stop firing popstate on hashchange
                            
                                Git showing identical files as changed
                            
                                Live Rendering a custom component using IB_DESIGNABLE from a pod dependency
                            
                                Getting Header information with RXJava and Retrofit

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With