I have a dataframe of the following form
import pandas as pd
Out[1]:
df = pd.DataFrame({'id':[1,2,3,4,5],
'group':['A','A','A','B','B'],
'start':['2012-08-19','2012-08-22','2013-08-19','2012-08-19','2013-08-19'],
'end':['2012-08-28','2013-09-13','2013-08-19','2012-12-19','2014-08-19']})
id group start end
0 1 A 2012-08-19 2012-08-28
1 2 A 2012-08-22 2013-09-13
2 3 A 2013-08-19 2013-08-21
3 4 B 2012-08-19 2012-12-19
4 5 B 2013-08-19 2014-08-19
For given row in my dataframe I'd like to count the number of items in the same group that have an overlapping time interval.
For example in group A id 2 ranges from 22 August 2012 to 13 Sept 2013 and hence the overlap between id 1 (19 August 2012 to 28 August 2012) and also id 3 (19 August 2013 to 21 August 2013) for a count of 2.
Conversely there is no overlap between the items in group B
So for my example dataframe above i'd like to produce something like
Out[2]:
id group start end count
0 1 A 2012-08-19 2012-08-28 1
1 2 A 2012-08-22 2013-09-13 2
2 3 A 2013-08-19 2013-08-21 1
3 4 B 2012-08-19 2012-12-19 0
4 5 B 2013-08-19 2014-08-19 0
I could "brute-force" this but I'd like to know if there is a more efficient Pandas way of getting this done.
Thanks in advance for your help
So, I would see how brute force fairs... if it's slow I'd cythonize this logic. It's not so bad, as whilst O(M^2) in group size, if there's lots of small groups it might not be so bad.
In [11]: def interval_overlaps(a, b):
...: return min(a["end"], b["end"]) - max(a["start"], b["start"]) > np.timedelta64(-1)
In [12]: def count_overlaps(df1):
...: return sum(interval_overlaps(df1.iloc[i], df1.iloc[j]) for i in range(len(df1) - 1) for j in range(i, len(df1)) if i < j)
In [13]: df.groupby("group").apply(count_overlaps)
Out[13]:
group
A 2
B 0
dtype: int64
The former is a tweaking of this interval overlap function.
Edit: Upon re-reading it looks like the count_overlaps is per-row, rather than per-group, so the agg function should be more like:
In [21]: def count_overlaps(df1):
...: return pd.Series([df1.apply(lambda x: interval_overlaps(x, df1.iloc[i]), axis=1).sum() - 1 for i in range(len(df1))], df1.index)
In [22]: df.groupby("group").apply(count_overlaps)
Out[22]:
group
A 0 1
1 2
2 1
B 3 0
4 0
dtype: int64
In [22]: df["count"] = df.groupby("group").apply(count_overlaps).values
In [23]: df
Out[23]:
end group id start count
0 2012-08-28 A 1 2012-08-19 1
1 2013-09-13 A 2 2012-08-22 2
2 2013-08-19 A 3 2013-08-19 1
3 2012-12-19 B 4 2012-08-19 0
4 2014-08-19 B 5 2013-08-19 0
"brute-force"ish but gets the job done:
First converted the date strings to dates and then compared each row against the df with an apply.
df.start = pd.to_datetime(df.start)
df.end = pd.to_datetime(df.end)
df['count'] = df.apply(lambda row: len(df[ ( ( (row.start <= df.start) & (df.start <= row.end) ) \
| ( (df.start <= row.start) & (row.start <= df.end) ) )
& (row.id != df.id) & (row.group == df.group) ]),axis=1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With