Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python/Pandas group close events together based on a window

I would like to group events that happened close together into a parent event.

An example is this:

import pandas as pd

df = pd.DataFrame(
    [
        ['2020-01-01 10:00', '1'],
        ['2020-01-01 10:01', '2'],
        ['2020-01-01 10:02', '3a'],
        ['2020-01-01 10:02', '3b'],
        ['2020-01-01 10:30', '4'],
        ['2020-01-01 10:50', '5'],
        ['2020-01-01 10:54', '6'],
        ['2020-01-01 10:55', '7'],
    ], columns=['event_time', 'event_id']
)
df['event_time'] = pd.to_datetime(df['event_time'])

In the above example with a window size of 1 minute what I want is for each event a list of all other events that occurred within +- 1 minute of it.

So something like:

df = pd.DataFrame(
    [
        ['2020-01-01 10:00', '1', ['2']],
        ['2020-01-01 10:01', '2', ['1','3a','3b']],
        ['2020-01-01 10:02', '3a', ['2','3b']],
        ['2020-01-01 10:02', '3b', ['3a', '2'],
        ['2020-01-01 10:30', '4', None],
        ['2020-01-01 10:50', '5', None],
        ['2020-01-01 10:54', '6', ['7']],
        ['2020-01-01 10:55', '7', ['6']],
    ], columns=['event_time', 'event_id', 'related_event_id_list']
)

I was almost able to get close with pandas merge_asof:

pd.merge_asof(df,df, on="event_time", tolerance=pd.Timedelta("1m"), allow_exact_matches=False, direction='nearest')

enter image description here

but it seems to just want to merge to one nearest event as opposed to some sort of option or way to merge all events within the tolerance for each row (and obviously just get more rows in my results so its more like an outer join on a range).

Was thinking grouping time-series events like this must be fairly common but could not really find any obvious functions in Pandas to do this or maybe I'm missing something.

As always trying to avoid writing the code myself via loops or anything like that if I can avoid it :)

like image 257
andrewm4894 Avatar asked Dec 09 '25 01:12

andrewm4894


1 Answers

One idea is filter DataFrame per rows with remove original row by DataFrame.drop:

td = pd.Timedelta("1m")
f = lambda x, y: df.loc[df['event_time'].between(y - td, y + td),
                        'event_id'].drop(x).tolist()
df['related_event_id_list'] = [f(k, v) for k, v in df['event_time'].items()]
print (df)
           event_time event_id related_event_id_list
0 2020-01-01 10:00:00        1                   [2]
1 2020-01-01 10:01:00        2           [1, 3a, 3b]
2 2020-01-01 10:02:00       3a               [2, 3b]
3 2020-01-01 10:02:00       3b               [2, 3a]
4 2020-01-01 10:30:00        4                    []
5 2020-01-01 10:50:00        5                    []
6 2020-01-01 10:54:00        6                   [7]
7 2020-01-01 10:55:00        7                   [6]
like image 54
jezrael Avatar answered Dec 11 '25 15:12

jezrael