I have a timeseries dataframe df
looks like this (the time seris happen within same day, but across different hours:
id val
time
2014-04-03 16:01:53 23 14389
2014-04-03 16:01:54 28 14391
2014-04-03 16:05:55 24 14393
2014-04-03 16:06:25 23 14395
2014-04-03 16:07:01 23 14395
2014-04-03 16:10:09 23 14395
2014-04-03 16:10:23 26 14397
2014-04-03 16:10:57 26 14397
2014-04-03 16:11:10 26 14397
I need to create group every 5 minutes from starting from 16:00:00
. That is all the rows with in the range 16:00:00
to 16:05:00
, its value of the new column period
is 1. (the number of rows within each group is irregular, so i can't simply cut the group)
Eventually, the data should look like this:
id val period
time
2014-04-03 16:01:53 23 14389 1
2014-04-03 16:01:54 28 14391 1
2014-04-03 16:05:55 24 14393 2
2014-04-03 16:06:25 23 14395 2
2014-04-03 16:07:01 23 14395 2
2014-04-03 16:10:09 23 14395 3
2014-04-03 16:10:23 26 14397 3
2014-04-03 16:10:57 26 14397 3
2014-04-03 16:11:10 26 14397 3
The purpose is to perform some groupby
operation, but the operation I need to do is not included in pd.resample(how=' ')
method. So I have to create a period
column to identify each group, then do df.groupby('period').apply(myfunc)
.
Any help or comments are highly appreciated.
Thanks!
How to perform groupby index in pandas? Pass index name of the DataFrame as a parameter to groupby() function to group rows on an index. DataFrame. groupby() function takes string or list as a param to specify the group columns or index.
Use pandas DataFrame. groupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values. It works with non-floating type data as well. The below example does the grouping on Courses column and calculates count how many times each value is present.
You can use the TimeGrouper
function in a groupy/apply
. With a TimeGrouper
you don't need to create your period column. I know you're not trying to compute the mean but I will use it as an example:
>>> df.groupby(pd.TimeGrouper('5Min'))['val'].mean()
time
2014-04-03 16:00:00 14390.000000
2014-04-03 16:05:00 14394.333333
2014-04-03 16:10:00 14396.500000
Or an example with an explicit apply
:
>>> df.groupby(pd.TimeGrouper('5Min'))['val'].apply(lambda x: len(x) > 3)
time
2014-04-03 16:00:00 False
2014-04-03 16:05:00 False
2014-04-03 16:10:00 True
Doctstring for TimeGrouper
:
Docstring for resample:class TimeGrouper@21
TimeGrouper(self, freq = 'Min', closed = None, label = None,
how = 'mean', nperiods = None, axis = 0, fill_method = None,
limit = None, loffset = None, kind = None, convention = None, base = 0,
**kwargs)
Custom groupby class for time-interval grouping
Parameters
----------
freq : pandas date offset or offset alias for identifying bin edges
closed : closed end of interval; left or right
label : interval boundary to use for labeling; left or right
nperiods : optional, integer
convention : {'start', 'end', 'e', 's'}
If axis is PeriodIndex
Notes
-----
Use begin, end, nperiods to generate intervals that cannot be derived
directly from the associated object
Edit
I don't know of an elegant way to create the period column, but the following will work:
>>> new = df.groupby(pd.TimeGrouper('5Min'),as_index=False).apply(lambda x: x['val'])
>>> df['period'] = new.index.get_level_values(0)
>>> df
id val period
time
2014-04-03 16:01:53 23 14389 0
2014-04-03 16:01:54 28 14391 0
2014-04-03 16:05:55 24 14393 1
2014-04-03 16:06:25 23 14395 1
2014-04-03 16:07:01 23 14395 1
2014-04-03 16:10:09 23 14395 2
2014-04-03 16:10:23 26 14397 2
2014-04-03 16:10:57 26 14397 2
2014-04-03 16:11:10 26 14397 2
It works because the groupby here with as_index=False actually returns the period column you want as the part of the multiindex and I just grab that part of the multiindex and assign to a new column in the orginal dataframe. You could do anything in the apply, I just want the index:
>>> new
time
0 2014-04-03 16:01:53 14389
2014-04-03 16:01:54 14391
1 2014-04-03 16:05:55 14393
2014-04-03 16:06:25 14395
2014-04-03 16:07:01 14395
2 2014-04-03 16:10:09 14395
2014-04-03 16:10:23 14397
2014-04-03 16:10:57 14397
2014-04-03 16:11:10 14397
>>> new.index.get_level_values(0)
Int64Index([0, 0, 1, 1, 1, 2, 2, 2, 2], dtype='int64')
Depending on what your doing if I understand the question right can be done a lot more easily just using the resample method
#Get some data
index = pd.DatetimeIndex(start='2013-01-01 00:00', end='2013-01-31 00:00', freq='min')
a = np.random.randint(20, high=30, size=(len(index),1))
b = np.random.randint(14440, high=14449, size=(len(index),1))
df = pd.DataFrame(np.concatenate((a,b), axis=1), index=index, columns=['id','val'])
df.head()
Out[34]:
id val
2013-01-01 00:00:00 20 14446
2013-01-01 00:01:00 25 14443
2013-01-01 00:02:00 25 14448
2013-01-01 00:03:00 20 14445
2013-01-01 00:04:00 28 14442
#Define function for variance
import numpy as np
def pyfun(X):
if X.shape[0] <= 1:
result = nan
else:
total = 0
for x in X:
total = total + x
mean = float(total) / X.shape[0]
total = 0
for x in X:
total = total + (mean-x)**2
result = float(total) / (X.shape[0]-1)
return result
#Try it out
df.resample('5min', how=pyfun)
Out[53]:
id val
2013-01-01 00:00:00 12.3 5.7
2013-01-01 00:05:00 9.3 7.3
2013-01-01 00:10:00 4.7 0.8
2013-01-01 00:15:00 10.8 10.3
2013-01-01 00:20:00 11.5 1.5
Well that was easy. This is for your own functions but if you want to use a function from a library then all you need to do is specify the function in the how keyword
df.resample('5min', how=np.var).head()
Out[54]:
id val
2013-01-01 00:00:00 12.3 5.7
2013-01-01 00:05:00 9.3 7.3
2013-01-01 00:10:00 4.7 0.8
2013-01-01 00:15:00 10.8 10.3
2013-01-01 00:20:00 11.5 1.5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With