Group by interval of datetime using pandas

Tags:

I have data

data    id  url size    domain  subdomain
13/Jun/2016:06:27:26    30055   https://api.weather.com/v1/geocode/55.740002/37.610001/aggregate.json?apiKey=e45ff1b7c7bda231216c7ab7c33509b8&products=conditionsshort,fcstdaily10short,fcsthourly24short,nowlinks  3929    weather.com api.weather.com
13/Jun/2016:06:27:26    30055   https://api.weather.com/v1/geocode/54.720001/20.469999/aggregate.json?apiKey=e45ff1b7c7bda231216c7ab7c33509b8&products=conditionsshort,fcstdaily10short,fcsthourly24short,nowlinks  3845    weather.com api.weather.com
13/Jun/2016:06:27:27    3845    https://api.weather.com/v1/geocode/54.970001/73.370003/aggregate.json?apiKey=e45ff1b7c7bda231216c7ab7c33509b8&products=conditionsshort,fcstdaily10short,fcsthourly24short,nowlinks  30055   weather.com api.weather.com
13/Jun/2016:06:27:27    30055   https://api.weather.com/v1/geocode/59.919998/30.219999/aggregate.json?apiKey=e45ff1b7c7bda231216c7ab7c33509b8&products=conditionsshort,fcstdaily10short,fcsthourly24short,nowlinks  3914    weather.com api.weather.com
13/Jun/2016:06:27:28    30055   https://facebook.com    4005    facebook.com    facebook.com

I need to group it with interval 5 minutes. Desire output

 data   id  url size    domain  subdomain
13/Jun/2016:06:27:26    30055   https://api.weather.com/v1/geocode/55.740002/37.610001/aggregate.json?apiKey=e45ff1b7c7bda231216c7ab7c33509b8&products=conditionsshort,fcstdaily10short,fcsthourly24short,nowlinks  3929    weather.com api.weather.com
13/Jun/2016:06:27:27    3845    https://api.weather.com/v1/geocode/54.970001/73.370003/aggregate.json?apiKey=e45ff1b7c7bda231216c7ab7c33509b8&products=conditionsshort,fcstdaily10short,fcsthourly24short,nowlinks  30055   weather.com api.weather.com
13/Jun/2016:06:27:28    30055   https://facebook.com    4005    facebook.com    facebook.com

I need to groupby id, subdomain and establish interval 5min I try use

print df.groupby([df['data'],pd.TimeGrouper(freq='Min')])

to group first with minute, but it return TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'

851

asked Jun 15 '16 20:06

Arseniy Krupenin

2 Answers

You need to parse data using pd.to_datetime() with appropriate format settings and use the result as index. Then .groupby() while resampling to 5Min intervals:

df.index = pd.to_datetime(df.data, format='%d/%b/%Y:%H:%M:%S')
df.groupby(pd.TimeGrouper('5Min')).apply(lambda x: x.groupby(['id', 'subdomain']).first())

                                                           data  \
data                id    subdomain                               
2016-06-13 06:25:00 3845  api.weather.com  13/Jun/2016:06:27:27   
                    30055 api.weather.com  13/Jun/2016:06:27:26   
                          facebook.com     13/Jun/2016:06:27:28   

                                                                                         url  \
data                id    subdomain                                                            
2016-06-13 06:25:00 3845  api.weather.com  https://api.weather.com/v1/geocode/54.970001/7...   
                    30055 api.weather.com  https://api.weather.com/v1/geocode/55.740002/3...   
                          facebook.com                                  https://facebook.com   

                                            size        domain  
data                id    subdomain                             
2016-06-13 06:25:00 3845  api.weather.com  30055   weather.com  
                    30055 api.weather.com   3929   weather.com  
                          facebook.com      4005  facebook.com

answered Oct 22 '22 23:10

Stefan

Note to convert to datetime you can pass the following format:

df['data'] = pd.to_datetime(df['data'], format="%d/%b/%Y:%H:%M:%S")

Now you can use the groupby:

In [11]: df1 = df.set_index("data")

In [12]: df1.groupby(pd.TimeGrouper("5Min")).sum()
Out[12]:
                         id   size
data
2016-06-13 06:25:00  124065  45748

This is better written as a resample:

In [13]: df1.resample("5Min").sum()
Out[13]:
                         id   size
data
2016-06-13 06:25:00  124065  45748

answered Oct 22 '22 23:10

Andy Hayden

Related questions
                            
                                C# Runtime Error: "DataGridViewComboBoxCell value is not valid"
                            
                                In CarrierWave, what does retrieve_from_store! do?
                            
                                How to use Google Blogger API with Python?
                            
                                Emailing pdf generated by dompdf with codeigniter
                            
                                Unit testing async void event handler
                            
                                Validation pattern for date in DD/MM/YYYY format using angular2
                            
                                How to create Lua table with name C-API
                            
                                How to deserialize object with date in C# [duplicate]
                            
                                DataTable row selection inside shiny module
                            
                                Spring Cloud Config Server + RabbitMQ
                            
                                Display csv file in array
                            
                                Maven enforcer append in children

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Group by interval of datetime using pandas

Tags:

python

datetime

pandas

Arseniy Krupenin

People also ask

2 Answers

Stefan

Andy Hayden

Recent Activity

Donate For Us