Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to group a pandas dataframe by a defined time interval?

I have a dataFrame like this, I would like to group every 60 minutes and start grouping at 06:30.

                           data index 2017-02-14 06:29:57    11198648 2017-02-14 06:30:01    11198650 2017-02-14 06:37:22    11198706 2017-02-14 23:11:13    11207728 2017-02-14 23:21:43    11207774 2017-02-14 23:22:36    11207776 

I am using:

df.groupby(pd.TimeGrouper(freq='60Min')) 

I get this grouping:

                      data index        2017-02-14 06:00:00     x1 2017-02-14 07:00:00     x2 2017-02-14 08:00:00     x3 2017-02-14 09:00:00     x4 2017-02-14 10:00:00     x5 

but I am looking for this result:

                      data index        2017-02-14 06:30:00     x1 2017-02-14 07:30:00     x2 2017-02-14 08:30:00     x3 2017-02-14 09:30:00     x4 2017-02-14 10:30:00     x5 

How can I tell the function to start grouping at 6:30 at one-hour intervals?

If it can not be done by the .groupby(pd.TimeGrouper(freq='60Min')), how is the best way to do it?

A salute and thanks very much in advance

like image 839
EduardoRL Avatar asked Feb 15 '17 16:02

EduardoRL


People also ask

How do I group data frames in pandas?

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. sort : Sort group keys.

How do I group specific rows in pandas?

You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.

Does pandas support time series?

pandas contains extensive capabilities and features for working with time series data for all domains. Using the NumPy datetime64 and timedelta64 dtypes, pandas has consolidated a large number of features from other Python libraries like scikits.


2 Answers

Use base=30 in conjunction with label='right' parameters in pd.Grouper.

Specifying label='right' makes the time-period to start grouping from 6:30 (higher side) and not 5:30. Also, base is set to 0 by default, hence the need to offset those by 30 to account for the forward propagation of dates.

Suppose, you want to aggregate the first element of every sub-group, then:

df.groupby(pd.Grouper(freq='60Min', base=30, label='right')).first() # same thing using resample - df.resample('60Min', base=30, label='right').first() 

yields:

                           data index                           2017-02-14 06:30:00  11198648.0 2017-02-14 07:30:00  11198650.0 2017-02-14 08:30:00         NaN 2017-02-14 09:30:00         NaN 2017-02-14 10:30:00         NaN 2017-02-14 11:30:00         NaN 2017-02-14 12:30:00         NaN 2017-02-14 13:30:00         NaN 2017-02-14 14:30:00         NaN 2017-02-14 15:30:00         NaN 2017-02-14 16:30:00         NaN 2017-02-14 17:30:00         NaN 2017-02-14 18:30:00         NaN 2017-02-14 19:30:00         NaN 2017-02-14 20:30:00         NaN 2017-02-14 21:30:00         NaN 2017-02-14 22:30:00         NaN 2017-02-14 23:30:00  11207728.0 
like image 160
Nickil Maveli Avatar answered Sep 16 '22 17:09

Nickil Maveli


Using DataFrame.resample which is a dedicated method for resampling time series, this way we dont need DataFrame.GroupBy and pd.Grouper:

df.resample('60min', base=30, label='right').first() 

Output

                           data index                           2017-02-14 06:30:00  11198648.0 2017-02-14 07:30:00  11198650.0 2017-02-14 08:30:00         NaN 2017-02-14 09:30:00         NaN 2017-02-14 10:30:00         NaN 2017-02-14 11:30:00         NaN 2017-02-14 12:30:00         NaN 2017-02-14 13:30:00         NaN 2017-02-14 14:30:00         NaN 2017-02-14 15:30:00         NaN 2017-02-14 16:30:00         NaN 2017-02-14 17:30:00         NaN 2017-02-14 18:30:00         NaN 2017-02-14 19:30:00         NaN 2017-02-14 20:30:00         NaN 2017-02-14 21:30:00         NaN 2017-02-14 22:30:00         NaN 2017-02-14 23:30:00  11207728.0 

Notice: when you have multiple columns in your dataframe, you have to specify the column you want to aggregate on:

df.resample('60min', base=30, label='right')['data'].first() 
like image 39
Erfan Avatar answered Sep 20 '22 17:09

Erfan