Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count rows matching a criteria relative to current row

I have a dataframe structured like this (but it actually has ~400k rows):

library(data.table)
df <- fread("    id     start     end
174095 2018-12-19 2018-12-31
227156 2018-12-19 2018-12-31
210610 2018-04-13 2018-09-27
 27677 2018-04-12 2018-04-26
370474 2017-07-13 2017-08-19
303693 2017-02-20 2017-04-09
 74744 2016-10-03 2016-11-05
174095 2018-12-01 2018-12-20
 27677 2018-03-01 2018-05-29
111111 2018-01-01 2018-01-31
111111 2018-11-11 2018-12-31")

(edited, thanks to Uwe)

For each row, I want to count how many rows in the dataframe have the same id as the current row and a start-end period that overlaps the period in the current row. For example, for the first row, the result would be 2, since there is another row with id = 174095 and its end is greater than first row start.

I tried to do it with dplyr's rowwise, like:

df = df %>% rowwise() %>% mutate(count = sum(id == df$id & ((start >= df$start & start <= df$end) | (end >= df$start & end <= df$end))))

But this is extremely slow. I gave it a try and two hours later it was still running.

I also tried to use mapply, but it takes way too much time as well:

df$count = mapply(function(id, start, end) {
return(sum(df$id == id & (between(df$start, start, end) | between(df$end, start, end))) }, id, start, end)

Is there an efficient reasonable way to do this?

Thank you very much


EDIT 2019-03-06

@Uwe 's suggested solution:

df[, overlapping.rows := df[.SD, on = .(id, start <= end, end >= start), .N, by = .EACHI]$N][]

works just fine for the sample data.frame above. But it turns out the sample wasn't illustrative enough, or I didn't really make myself understood maybe :)

I added a third record for id 174095 and modified the other two:

df <- fread("id     start     end
174095 2018-12-19 2018-12-31
            227156 2018-12-19 2018-12-31
            210610 2018-04-13 2018-09-27
            27677 2018-04-12 2018-04-26
            370474 2017-07-13 2017-08-19
            303693 2017-02-20 2017-04-09
            74744 2016-10-03 2016-11-05
            174095 2018-12-01 2018-12-18
            27677 2018-03-01 2018-05-29
            111111 2018-01-01 2018-01-31
            111111 2018-11-11 2018-12-31
            174095 2018-11-30 2018-12-25")

Now, id 174095 has two intervals that do not overlap between them (rows 1 and 2) and another interval that overlaps the other two (row 3):

           id      start        end
1: 174095 2018-12-19 2018-12-31
2: 174095 2018-12-01 2018-12-18
3: 174095 2018-11-30 2018-12-25

So, the result should be:

       id      start        end overlapping.rows
1: 174095 2018-12-19 2018-12-31                2
2: 174095 2018-12-01 2018-12-18                2
3: 174095 2018-11-30 2018-12-25                3

But it actually is:

       id      start        end overlapping.rows
1: 174095 2018-12-19 2018-12-31                3
2: 174095 2018-12-01 2018-12-18                3
3: 174095 2018-11-30 2018-12-25                3

If I'm not mistaken, this is happening because the final join is done by "id" only, so all the rows with the same id have the same result.

My solution consists on performing the final merge also by "start" and "end":

df[tmp, on = .(id, start, end), overlapping.rows := N]

For some reason (I would love to find out...), on the self-join, start dates end up in the "end" column and vice-versa, so I had to add this line right after it:

setnames(tmp, c("id", "end", "start", "N"))

Now, the result is:

            id      start        end overlapping.rows
 1: 174095 2018-12-19 2018-12-31                2
 2: 227156 2018-12-19 2018-12-31                1
 3: 210610 2018-04-13 2018-09-27                1
 4:  27677 2018-04-12 2018-04-26                2
 5: 370474 2017-07-13 2017-08-19                1
 6: 303693 2017-02-20 2017-04-09                1
 7:  74744 2016-10-03 2016-11-05                1
 8: 174095 2018-12-01 2018-12-18                2
 9:  27677 2018-03-01 2018-05-29                2
10: 111111 2018-01-01 2018-01-31                1
11: 111111 2018-11-11 2018-12-31                1
12: 174095 2018-11-30 2018-12-25                3

Which is exactly what I expected!

like image 730
Diego Avatar asked Jan 02 '23 05:01

Diego


2 Answers

Edit 2019-03-07 to cope with OP's expanded dataset

This can be solved by aggregating in a non-equi self-join

library(data.table)
# coerce character dates to IDate class
cols <- c("start", "end")
setDT(df)[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
# non-equi self-join and aggregate
tmp <- df[df, on = .(id, start <= end, end >= start), .N, by = .EACHI]
# append counts to original dataset
df[, overlapping.rows := tmp$N]
df
        id      start        end overlapping.rows
 1: 174095 2018-12-19 2018-12-31                2
 2: 227156 2018-12-19 2018-12-31                1
 3: 210610 2018-04-13 2018-09-27                1
 4:  27677 2018-04-12 2018-04-26                2
 5: 370474 2017-07-13 2017-08-19                1
 6: 303693 2017-02-20 2017-04-09                1
 7:  74744 2016-10-03 2016-11-05                1
 8: 174095 2018-12-01 2018-12-18                2
 9:  27677 2018-03-01 2018-05-29                2
10: 111111 2018-01-01 2018-01-31                1
11: 111111 2018-11-11 2018-12-31                1
12: 174095 2018-11-30 2018-12-25                3

Using data.table chaining the code can be written in a more compact but also more convoluted way:

library(data.table)
cols <- c("start", "end")
setDT(df)[, (cols) := lapply(.SD, as.IDate), .SDcols = cols][
  , overlapping.rows := df[df, on = .(id, start <= end, end >= start), .N, by = .EACHI]$N][]

Note that the part to append the results to the original df is based on Frank's comment.


My original attempt to use a second join to append the results to the original df failed in case there are different counts for the same id as pointed out by the OP. This can be fixed by including the row number in the second join:

library(data.table)
# coerce character dates to IDate class
cols <- c("start", "end")
setDT(df)[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
# append row number
tmp <- df[, rn := .I][
  # non-equi self-join and aggregate
  df, on = .(id, start <= end, end >= start), .(rn = i.rn, .N), by = .EACHI]
# append counts to original dataset by joining on row number
df[tmp, on = "rn", overlapping.rows := N][, rn := NULL]
df
        id      start        end overlapping.rows
 1: 174095 2018-12-19 2018-12-31                2
 2: 227156 2018-12-19 2018-12-31                1
 3: 210610 2018-04-13 2018-09-27                1
 4:  27677 2018-04-12 2018-04-26                2
 5: 370474 2017-07-13 2017-08-19                1
 6: 303693 2017-02-20 2017-04-09                1
 7:  74744 2016-10-03 2016-11-05                1
 8: 174095 2018-12-01 2018-12-18                2
 9:  27677 2018-03-01 2018-05-29                2
10: 111111 2018-01-01 2018-01-31                1
11: 111111 2018-11-11 2018-12-31                1
12: 174095 2018-11-30 2018-12-25                3

Explanation

The join condition in the non-equi join does the trick. Two intervals do not overlap if the first one ends before the second one starts or the first interval starts after the second interval has ended,

e1 < s2 OR e2 < s1

Now, if two intervals do intersect/overlap then the opposite of the above must be true. By negating and applying De Morgan's law we get the conditions

s2 <= e1 AND e2 >= s1

which are used in the non-equi join.

Data

OP's expanded dataset as described in OP's EDIT 2019-03-06:

library(data.table)
df <- fread("id     start     end
174095 2018-12-19 2018-12-31
227156 2018-12-19 2018-12-31
210610 2018-04-13 2018-09-27
27677  2018-04-12 2018-04-26
370474 2017-07-13 2017-08-19
303693 2017-02-20 2017-04-09
74744  2016-10-03 2016-11-05
174095 2018-12-01 2018-12-18
27677  2018-03-01 2018-05-29
111111 2018-01-01 2018-01-31
111111 2018-11-11 2018-12-31
174095 2018-11-30 2018-12-25")
like image 157
Uwe Avatar answered Jan 03 '23 17:01

Uwe


I've misunderstood the question initially, and I think @Uwe's approach is the way to go. In my first answer I've used data.table to identify groups (and how many rows in a group) of consequent dates per id, obviously not what you're after.

Here's also a short sqldf snippet to complement @Uwe's approach (though not as adequate, as here the row order is not preserved - this would require some additional tinkering):

library(sqldf)

df <- sqldf('SELECT id, start, end, COUNT(*) as overlappingRows FROM (SELECT df.* FROM df 
            LEFT OUTER JOIN df AS df2 
            ON df.id = df2.id AND df.start <= df2.end AND df.end >= df2.start) as origdf 
            GROUP BY id, start, end')

Output:

       id      start        end overlappingRows
1   27677 2018-03-01 2018-05-29               2
2   27677 2018-04-12 2018-04-26               2
3   74744 2016-10-03 2016-11-05               1
4  111111 2018-01-01 2018-01-31               1
5  111111 2018-11-11 2018-12-31               1
6  174095 2018-12-01 2018-12-20               2
7  174095 2018-12-19 2018-12-31               2
8  210610 2018-04-13 2018-09-27               1
9  227156 2018-12-19 2018-12-31               1
10 303693 2017-02-20 2017-04-09               1
11 370474 2017-07-13 2017-08-19               1
like image 29
arg0naut91 Avatar answered Jan 03 '23 17:01

arg0naut91