I have a dataframe structured like this (but it actually has ~400k rows):
library(data.table)
df <- fread(" id start end
174095 2018-12-19 2018-12-31
227156 2018-12-19 2018-12-31
210610 2018-04-13 2018-09-27
27677 2018-04-12 2018-04-26
370474 2017-07-13 2017-08-19
303693 2017-02-20 2017-04-09
74744 2016-10-03 2016-11-05
174095 2018-12-01 2018-12-20
27677 2018-03-01 2018-05-29
111111 2018-01-01 2018-01-31
111111 2018-11-11 2018-12-31")
(edited, thanks to Uwe)
For each row, I want to count how many rows in the dataframe have the same id as the current row and a start-end period that overlaps the period in the current row. For example, for the first row, the result would be 2, since there is another row with id = 174095 and its end is greater than first row start.
I tried to do it with dplyr's rowwise, like:
df = df %>% rowwise() %>% mutate(count = sum(id == df$id & ((start >= df$start & start <= df$end) | (end >= df$start & end <= df$end))))
But this is extremely slow. I gave it a try and two hours later it was still running.
I also tried to use mapply, but it takes way too much time as well:
df$count = mapply(function(id, start, end) {
return(sum(df$id == id & (between(df$start, start, end) | between(df$end, start, end))) }, id, start, end)
Is there an efficient reasonable way to do this?
Thank you very much
EDIT 2019-03-06
@Uwe 's suggested solution:
df[, overlapping.rows := df[.SD, on = .(id, start <= end, end >= start), .N, by = .EACHI]$N][]
works just fine for the sample data.frame above. But it turns out the sample wasn't illustrative enough, or I didn't really make myself understood maybe :)
I added a third record for id 174095 and modified the other two:
df <- fread("id start end
174095 2018-12-19 2018-12-31
227156 2018-12-19 2018-12-31
210610 2018-04-13 2018-09-27
27677 2018-04-12 2018-04-26
370474 2017-07-13 2017-08-19
303693 2017-02-20 2017-04-09
74744 2016-10-03 2016-11-05
174095 2018-12-01 2018-12-18
27677 2018-03-01 2018-05-29
111111 2018-01-01 2018-01-31
111111 2018-11-11 2018-12-31
174095 2018-11-30 2018-12-25")
Now, id 174095 has two intervals that do not overlap between them (rows 1 and 2) and another interval that overlaps the other two (row 3):
id start end
1: 174095 2018-12-19 2018-12-31
2: 174095 2018-12-01 2018-12-18
3: 174095 2018-11-30 2018-12-25
So, the result should be:
id start end overlapping.rows
1: 174095 2018-12-19 2018-12-31 2
2: 174095 2018-12-01 2018-12-18 2
3: 174095 2018-11-30 2018-12-25 3
But it actually is:
id start end overlapping.rows
1: 174095 2018-12-19 2018-12-31 3
2: 174095 2018-12-01 2018-12-18 3
3: 174095 2018-11-30 2018-12-25 3
If I'm not mistaken, this is happening because the final join is done by "id" only, so all the rows with the same id have the same result.
My solution consists on performing the final merge also by "start" and "end":
df[tmp, on = .(id, start, end), overlapping.rows := N]
For some reason (I would love to find out...), on the self-join, start dates end up in the "end" column and vice-versa, so I had to add this line right after it:
setnames(tmp, c("id", "end", "start", "N"))
Now, the result is:
id start end overlapping.rows
1: 174095 2018-12-19 2018-12-31 2
2: 227156 2018-12-19 2018-12-31 1
3: 210610 2018-04-13 2018-09-27 1
4: 27677 2018-04-12 2018-04-26 2
5: 370474 2017-07-13 2017-08-19 1
6: 303693 2017-02-20 2017-04-09 1
7: 74744 2016-10-03 2016-11-05 1
8: 174095 2018-12-01 2018-12-18 2
9: 27677 2018-03-01 2018-05-29 2
10: 111111 2018-01-01 2018-01-31 1
11: 111111 2018-11-11 2018-12-31 1
12: 174095 2018-11-30 2018-12-25 3
Which is exactly what I expected!
Edit 2019-03-07 to cope with OP's expanded dataset
This can be solved by aggregating in a non-equi self-join
library(data.table)
# coerce character dates to IDate class
cols <- c("start", "end")
setDT(df)[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
# non-equi self-join and aggregate
tmp <- df[df, on = .(id, start <= end, end >= start), .N, by = .EACHI]
# append counts to original dataset
df[, overlapping.rows := tmp$N]
df
id start end overlapping.rows 1: 174095 2018-12-19 2018-12-31 2 2: 227156 2018-12-19 2018-12-31 1 3: 210610 2018-04-13 2018-09-27 1 4: 27677 2018-04-12 2018-04-26 2 5: 370474 2017-07-13 2017-08-19 1 6: 303693 2017-02-20 2017-04-09 1 7: 74744 2016-10-03 2016-11-05 1 8: 174095 2018-12-01 2018-12-18 2 9: 27677 2018-03-01 2018-05-29 2 10: 111111 2018-01-01 2018-01-31 1 11: 111111 2018-11-11 2018-12-31 1 12: 174095 2018-11-30 2018-12-25 3
Using data.table chaining the code can be written in a more compact but also more convoluted way:
library(data.table)
cols <- c("start", "end")
setDT(df)[, (cols) := lapply(.SD, as.IDate), .SDcols = cols][
, overlapping.rows := df[df, on = .(id, start <= end, end >= start), .N, by = .EACHI]$N][]
Note that the part to append the results to the original df
is based on Frank's comment.
My original attempt to use a second join to append the results to the original df
failed in case there are different counts for the same id
as pointed out by the OP. This can be fixed by including the row number in the second join:
library(data.table)
# coerce character dates to IDate class
cols <- c("start", "end")
setDT(df)[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
# append row number
tmp <- df[, rn := .I][
# non-equi self-join and aggregate
df, on = .(id, start <= end, end >= start), .(rn = i.rn, .N), by = .EACHI]
# append counts to original dataset by joining on row number
df[tmp, on = "rn", overlapping.rows := N][, rn := NULL]
df
id start end overlapping.rows 1: 174095 2018-12-19 2018-12-31 2 2: 227156 2018-12-19 2018-12-31 1 3: 210610 2018-04-13 2018-09-27 1 4: 27677 2018-04-12 2018-04-26 2 5: 370474 2017-07-13 2017-08-19 1 6: 303693 2017-02-20 2017-04-09 1 7: 74744 2016-10-03 2016-11-05 1 8: 174095 2018-12-01 2018-12-18 2 9: 27677 2018-03-01 2018-05-29 2 10: 111111 2018-01-01 2018-01-31 1 11: 111111 2018-11-11 2018-12-31 1 12: 174095 2018-11-30 2018-12-25 3
The join condition in the non-equi join does the trick. Two intervals do not overlap if the first one ends before the second one starts or the first interval starts after the second interval has ended,
e1 < s2 OR e2 < s1
Now, if two intervals do intersect/overlap then the opposite of the above must be true. By negating and applying De Morgan's law we get the conditions
s2 <= e1 AND e2 >= s1
which are used in the non-equi join.
OP's expanded dataset as described in OP's EDIT 2019-03-06:
library(data.table)
df <- fread("id start end
174095 2018-12-19 2018-12-31
227156 2018-12-19 2018-12-31
210610 2018-04-13 2018-09-27
27677 2018-04-12 2018-04-26
370474 2017-07-13 2017-08-19
303693 2017-02-20 2017-04-09
74744 2016-10-03 2016-11-05
174095 2018-12-01 2018-12-18
27677 2018-03-01 2018-05-29
111111 2018-01-01 2018-01-31
111111 2018-11-11 2018-12-31
174095 2018-11-30 2018-12-25")
I've misunderstood the question initially, and I think @Uwe's approach is the way to go. In my first answer I've used data.table
to identify groups (and how many rows in a group) of consequent dates per id
, obviously not what you're after.
Here's also a short sqldf
snippet to complement @Uwe's approach (though not as adequate, as here the row order is not preserved - this would require some additional tinkering):
library(sqldf)
df <- sqldf('SELECT id, start, end, COUNT(*) as overlappingRows FROM (SELECT df.* FROM df
LEFT OUTER JOIN df AS df2
ON df.id = df2.id AND df.start <= df2.end AND df.end >= df2.start) as origdf
GROUP BY id, start, end')
Output:
id start end overlappingRows
1 27677 2018-03-01 2018-05-29 2
2 27677 2018-04-12 2018-04-26 2
3 74744 2016-10-03 2016-11-05 1
4 111111 2018-01-01 2018-01-31 1
5 111111 2018-11-11 2018-12-31 1
6 174095 2018-12-01 2018-12-20 2
7 174095 2018-12-19 2018-12-31 2
8 210610 2018-04-13 2018-09-27 1
9 227156 2018-12-19 2018-12-31 1
10 303693 2017-02-20 2017-04-09 1
11 370474 2017-07-13 2017-08-19 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With