EDIT: Session generation from log file analysis with pandas seems to be exactly what I was looking for.
I have a dataframe that includes non-unique time stamps, and I'd like to group them by time windows. The basic logic would be -
1) Create a time range from each time stamp by adding n minutes before and after the time stamp.
2) Group by time ranges that overlap. The end effect here would be that a time window would be as small as a single time stamp +/- the time buffer, but there is no cap on how large a time window could be, as long as multiple events were less distance apart than the time buffer
It feels like a df.groupby(pd.TimeGrouper(minutes=n)) is the right answer, but I don't know how to have the TimeGrouper create dynamic time ranges when it sees events that are within a time buffer.
For instance, if I try a TimeGrouper('20s') against a set of events: 10:34:00, 10:34:08, 10:34:08, 10:34:15, 10:34:28 and 10:34:54, then pandas will give me three groups (events falling between 10:34:00 - 10:34:20, 10:34:20 - 10:34:40, and 10:34:40-10:35:00). I would like to just get two groups back, 10:34:00 - 10:34:28, since there is no more than a 20 second gap between events in that time range, and a second group that is 10:34:54.
What is the best way to find temporal windows that are not static bins of time ranges?
Given a Series that looks something like -
time
0 2013-01-01 10:34:00+00:00
1 2013-01-01 10:34:12+00:00
2 2013-01-01 10:34:28+00:00
3 2013-01-01 10:34:54+00:00
4 2013-01-01 10:34:55+00:00
5 2013-01-01 10:35:19+00:00
6 2013-01-01 10:35:30+00:00
If I do a df.groupby(pd.TimeGrouper('20s')) on that Series, I would get back 5 group, 10:34:00-:20, :20-:40, :40-10:35:00, etc. What I want to do is have some function that creates elastic timeranges.. as long as events are within 20 seconds, expand the timerange. So I expect to get back -
2013-01-01 10:34:00 - 2013-01-01 10:34:48
0 2013-01-01 10:34:00+00:00
1 2013-01-01 10:34:12+00:00
2 2013-01-01 10:34:28+00:00
2013-01-01 10:34:54 - 2013-01-01 10:35:15
3 2013-01-01 10:34:54+00:00
4 2013-01-01 10:34:55+00:00
2013-01-01 10:35:19 - 2013-01-01 10:35:50
5 2013-01-01 10:35:19+00:00
6 2013-01-01 10:35:30+00:00
Thanks.
This is how to use to create a custom grouper. (requires pandas >= 0.13) for the timedelta computations, but otherwise would work in other versions.
Create your series
In [31]: s = Series(range(6),pd.to_datetime(['20130101 10:34','20130101 10:34:08', '20130101 10:34:08', '20130101 10:34:15', '20130101 10:34:28', '20130101 10:34:54','20130101 10:34:55','20130101 10:35:12']))
In [32]: s
Out[32]:
2013-01-01 10:34:00 0
2013-01-01 10:34:08 1
2013-01-01 10:34:08 2
2013-01-01 10:34:15 3
2013-01-01 10:34:28 4
2013-01-01 10:34:54 5
2013-01-01 10:34:55 6
2013-01-01 10:35:12 7
dtype: int64
This just computes the time difference in seconds between successive elements, but could actually be anything
In [33]: indexer = s.index.to_series().order().diff().fillna(0).astype('timedelta64[s]')
In [34]: indexer
Out[34]:
2013-01-01 10:34:00 0
2013-01-01 10:34:08 8
2013-01-01 10:34:08 0
2013-01-01 10:34:15 7
2013-01-01 10:34:28 13
2013-01-01 10:34:54 26
2013-01-01 10:34:55 1
2013-01-01 10:35:12 17
dtype: float64
Arbitrariy assign things < 20s to group 0, else to group 1. This could also be more arbitrary. if the diff from previous is < 0 BUT the total diff (from first) is > 50 make in group 2.
In [35]: grouper = indexer.copy()
In [36]: grouper[indexer<20] = 0
In [37]: grouper[indexer>20] = 1
In [95]: grouper[(indexer<20) & (indexer.cumsum()>50)] = 2
In [96]: grouper
Out[96]:
2013-01-01 10:34:00 0
2013-01-01 10:34:08 0
2013-01-01 10:34:08 0
2013-01-01 10:34:15 0
2013-01-01 10:34:28 0
2013-01-01 10:34:54 1
2013-01-01 10:34:55 2
2013-01-01 10:35:12 2
dtype: float64
Groupem (can also use an apply here)
In [97]: s.groupby(grouper).sum()
Out[97]:
0 10
1 5
2 13
dtype: int64
You might want consider using apply:
def my_grouper(datetime_value):
return some_group(datetime_value)
df.groupby(df['date_time'].apply(my_grouper))
It's up to you to implement just any grouping logic in your grouper function. Btw, merging overlapping time ranges is kind of iterative task: for example, A = (0, 10), B = (20, 30), C = (10, 20). After C appears, all three, A, B and C should be merged.
UPD:
This is my ugly version of merging algorithm:
groups = {}
def in_range(val, begin, end):
return begin <= val <= end
global max_group_id
max_group_id = 1
def find_merged_group(begin, end):
global max_group_id
found_common_group = None
full_wraps = []
for (group_start, group_end), group in groups.iteritems():
begin_inclusion = in_range(begin, group_start, group_end)
end_inclusion = in_range(end, group_start, group_end)
full_inclusion = begin_inclusion and end_inclusion
full_wrap = not begin_inclusion and not end_inclusion and in_range(group_start, begin, end) and in_range(group_end, begin, end)
if full_inclusion:
groups[(begin, end)] = group
return group
if full_wrap:
full_wraps.append(group)
elif begin_inclusion or end_inclusion:
if not found_common_group:
found_common_group = group
else: # merge
for range, g in groups.iteritems():
if g == group:
groups[range] = found_common_group
if not found_common_group:
found_common_group = max_group_id
max_group_id += 1
groups[(begin, end)] = found_common_group
return found_common_group
def my_grouper(date_time):
return find_merged_group(date_time - 1, date_time + 1)
df['datetime'].apply(my_grouper) # first run to fill groups dict
grouped = df.groupby(df['datetime'].apply(my_grouper)) # this run is using already merged groups
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With