Given the following pandas dataframe:
timestamp
0 2018-10-05 23:07:02
1 2018-10-05 23:07:13
2 2018-10-05 23:07:23
3 2018-10-05 23:07:36
4 2018-10-05 23:08:02
5 2018-10-05 23:09:16
6 2018-10-05 23:09:21
7 2018-10-05 23:09:39
8 2018-10-05 23:09:47
9 2018-10-05 23:10:01
10 2018-10-05 23:10:11
11 2018-10-05 23:10:23
12 2018-10-05 23:10:59
13 2018-10-05 23:11:03
14 2018-10-08 03:35:32
15 2018-10-08 03:35:58
16 2018-10-08 03:37:16
17 2018-10-08 03:38:04
18 2018-10-08 03:38:30
19 2018-10-08 03:38:36
20 2018-10-08 03:38:42
21 2018-10-08 03:38:52
22 2018-10-08 03:38:57
23 2018-10-08 03:39:10
24 2018-10-08 03:39:27
25 2018-10-08 03:40:47
26 2018-10-08 03:40:54
27 2018-10-08 03:41:02
28 2018-10-08 03:41:12
29 2018-10-08 03:41:32
How can I label in periods of ten minutes each rows? For example:
timestamp 10min_period
0 2018-10-05 23:07:02 period_1
2 2018-10-05 23:07:23 period_1
1 2018-10-05 23:07:13 period_1
2 2018-10-05 23:07:23 period_1
3 2018-10-05 23:07:36 period_1
4 2018-10-05 23:08:02 period_1
5 2018-10-05 23:09:16 period_1
6 2018-10-05 23:09:21 period_1
7 2018-10-05 23:09:39 period_1
8 2018-10-05 23:09:47 period_1
9 2018-10-05 23:10:01 period_1
10 2018-10-05 23:10:11 period_1
11 2018-10-05 23:10:23 period_1
12 2018-10-05 23:10:59 period_1
13 2018-10-05 23:11:03 period_1
14 2018-10-08 03:35:32 period_2
15 2018-10-08 03:35:58 period_2
16 2018-10-08 03:37:16 period_2
17 2018-10-08 03:38:04 period_2
18 2018-10-08 03:38:30 period_2
19 2018-10-08 03:38:36 period_2
20 2018-10-08 03:38:42 period_2
21 2018-10-08 03:38:52 period_2
22 2018-10-08 03:38:57 period_2
23 2018-10-08 03:39:10 period_2
24 2018-10-08 03:39:27 period_2
25 2018-10-08 03:40:47 period_2
26 2018-10-08 04:40:54 period_3
27 2018-10-08 04:41:02 period_3
28 2018-10-08 04:41:12 period_3
29 2018-10-08 04:41:32 period_3
As you can see in the above expected output, each period_n label is created by counting 10 minutes periods of time, when the datetime series exceeds a threshold of ten minutes a new label is created. I tried to use the dt.floor(10Min) object, however, its not working because it doesnt keep track of where to start and where to end counting the periods of 10 minutes. Also I tried to:
a = df['timestamp'].offsets.DateOffset(minutes=10)
However, its not working. Any idea of how to segment my DF in periods of 10 minutes? This question is different from others because I am not specifying any particular time to start to count. That is, I start to count from the first datetime row instance, and from that I start to count the periods of ten time minutes.
UPDATE:
After transforming to datetime object, I also tried to
df['timestamp'].groupby(pd.TimeGrouper(freq='10Min'))
However, I got:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
With a little vectorised arithmetic, this should be possible (and be performant):
# Convert to datetime if not already.
# df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
u = (df.assign(timestamp=df['timestamp'].dt.floor('20min'))
.groupby(pd.Grouper(key='timestamp',freq='10min'))
.ngroup())
df['10min_period'] = np.char.add('period_', (pd.factorize(u)[0] + 1).astype(str))
Unfortunately, the drawback here is that, while this will produce the expected output for your sample data, there is no easy way to handle contiguous intervals of 10 minutes (pd.Grouper does not begin grouping from the first date in your column, so the dt.floor('20min') is necessary as an initial step—this will inadvertently lump in some—or most—rows from "period_{i+1}" under "period_{i}").
For reproduce your problem I do this:
index = pd.date_range(datetime.datetime.now().date() - datetime.timedelta(10), periods=100, freq='min')
This way I have this DataFrame:
a = pd.DataFrame(index)
a
0
0 2018-10-28 00:00:00
1 2018-10-28 00:01:00
2 2018-10-28 00:02:00
3 2018-10-28 00:03:00
4 2018-10-28 00:04:00
5 2018-10-28 00:05:00
6 2018-10-28 00:06:00
7 2018-10-28 00:07:00
8 2018-10-28 00:08:00
9 2018-10-28 00:09:00
10 2018-10-28 00:10:00
...
[100 rows x 1 columns]
Then, I do this:
a['period'] = a.apply(lambda x: "perdio_%d"%(int(x[0].minute/10) + 1), axis=1)
And I have this solution:
0 period
0 2018-10-28 00:00:00 perdio_1
1 2018-10-28 00:01:00 perdio_1
2 2018-10-28 00:02:00 perdio_1
3 2018-10-28 00:03:00 perdio_1
4 2018-10-28 00:04:00 perdio_1
5 2018-10-28 00:05:00 perdio_1
6 2018-10-28 00:06:00 perdio_1
7 2018-10-28 00:07:00 perdio_1
8 2018-10-28 00:08:00 perdio_1
9 2018-10-28 00:09:00 perdio_1
10 2018-10-28 00:10:00 perdio_2
11 2018-10-28 00:11:00 perdio_2
12 2018-10-28 00:12:00 perdio_2
13 2018-10-28 00:13:00 perdio_2
14 2018-10-28 00:14:00 perdio_2
15 2018-10-28 00:15:00 perdio_2
...
I hope it has been helpfull
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With