Is there any tool in Pandas or other python package to facilitate the labeling of groups of varying number rows in a Pandas’ dataFrame? Here are more detail:
My dataFrame has a column of datetime.datetime objects used as the dataFrame index. The time is not evenly spaced and during a given amount of time, the number of rows varies.
In that dataFrame, I need to make a column that has a different label for each series of rows that span over a constant time span (12 seconds in my case).
For example, using examples with seconds
Time(s) | label
1 | 0
2 | 0
3 | 0
7 | 0
12 | 1
15 | 1
20 | 1
24 | 2
Etc…
I’m looking for an alternative as a brute-force loop. Algorithms somewhat related to that are labeling, segmenting, mapping but i'm not sure what to look for exactly. I was hoping there would be something exploiting python, Pandas and/or Numpy in a more efficient way since the number of rows is gonna be of the order of a hundred millions.
Thanks
Demo:
Sample DF:
In [32]: df = pd.DataFrame({'Time':pd.date_range('2017-01-01 00:00:01', freq='3S', periods=15)})
In [33]: df
Out[33]:
Time
0 2017-01-01 00:00:01
1 2017-01-01 00:00:04
2 2017-01-01 00:00:07
3 2017-01-01 00:00:10
4 2017-01-01 00:00:13
5 2017-01-01 00:00:16
6 2017-01-01 00:00:19
7 2017-01-01 00:00:22
8 2017-01-01 00:00:25
9 2017-01-01 00:00:28
10 2017-01-01 00:00:31
11 2017-01-01 00:00:34
12 2017-01-01 00:00:37
13 2017-01-01 00:00:40
14 2017-01-01 00:00:43
Solution:
In [34]: df['label'] = df.groupby(pd.Grouper(key='Time', freq='12S')).ngroup()
Result:
In [35]: df
Out[35]:
Time label
0 2017-01-01 00:00:01 0
1 2017-01-01 00:00:04 0
2 2017-01-01 00:00:07 0
3 2017-01-01 00:00:10 0
4 2017-01-01 00:00:13 1
5 2017-01-01 00:00:16 1
6 2017-01-01 00:00:19 1
7 2017-01-01 00:00:22 1
8 2017-01-01 00:00:25 2
9 2017-01-01 00:00:28 2
10 2017-01-01 00:00:31 2
11 2017-01-01 00:00:34 2
12 2017-01-01 00:00:37 3
13 2017-01-01 00:00:40 3
14 2017-01-01 00:00:43 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With