Edit (2019-06): This problem does not exist anymore, as this issue has been closed and a related feature implemented. If you now run the code with updated packages, it will work.
I'm trying to find overlapping intervals and decided to join the interval data on itself with dplyr::left_join()
so that I could compare intervals with lubridate::int_overlaps()
to every other interval by the same id.
Here's how I expect left_join()
to behave. The two tibbles with three rows cross to form a tibble with 9 rows:
library(tidyverse)
tibble(a = rep("a", 3), b = rep(1, 3)) %>%
left_join(tibble(a = rep("a", 3), c = rep(2, 3)))
Joining, by = "a"
# A tibble: 9 x 3
a b c
<chr> <dbl> <dbl>
1 a 1 2
2 a 1 2
3 a 1 2
4 a 1 2
5 a 1 2
6 a 1 2
7 a 1 2
8 a 1 2
9 a 1 2
And here's how the same code behaves with intervals. I get nine rows but the rows don't cross like they do above:
tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>%
left_join(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003))))
Joining, by = "a"
# A tibble: 9 x 3
a b c
<chr> <S4: Interval> <S4: Interval>
1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4 a NA--NA NA--NA
5 a NA--NA NA--NA
6 a NA--NA NA--NA
7 a NA--NA NA--NA
8 a NA--NA NA--NA
9 a NA--NA NA--NA
I think this is unexpected, but I might be missing something? Or is it a bug?
I'm using lubridate 1.7.1, tibble 1.3.4 and dplyr 0.7.4.
The bug
The object still contains the relevant information:
res <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>%
left_join(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003))))
print.data.frame(res)
# a b c
# 1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 7 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 8 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 9 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
res$c
# [1] 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# [5] 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# [9] 2002-01-01 UTC--2003-01-01 UTC
But when subsetting by indices it doesn't work anywmore :
res_df <- as.data.frame(res)
head(res_df)
a b c
1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4 a NA--NA NA--NA
5 a NA--NA NA--NA
6 a NA--NA NA--NA
res_df[4,"c"]
[1] NA--NA
and tibble:::print.tbl
makes use of head
. That's why the issue is immediately visible with tibbles
and not with data.frames
.
Typing str(res$b)
we see that we only have 3 start
values for 9 data
values.
if we do:
res_df$b@start <- rep(res_df$b@start,3)
res_df$c@start <- rep(res_df$c@start,3)
eveything now print fine:
a b c
1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
7 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
8 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
9 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
The Solution
We've seen that as.data.frame
is not enough, left_join
is the function messing things up, use merge
instead:
res <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>%
merge(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003))),
all.x=TRUE)
head(res)
# a b c
# 1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
res[4,"c"]
#[1] 2002-01-01 UTC--2003-01-01 UTC
I've reported the issue here
Looks like a bug in tibble()
:
> AA <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3))
> class(AA$b)
[1] "Interval"
attr(,"package")
[1] "lubridate"
> AA
Error in round_x - lhs :
Arithmetic operators undefined for 'Interval' and 'Interval' classes:
convert one to numeric or a matching time-span class.
However:
> AA <- as.data.frame(AA)
class(AA$b)
> class(AA$b)
[1] "Interval"
attr(,"package")
[1] "lubridate"
> AA
a b
1 a 2001-01-01 UTC--2002-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC
Therefore, this works:
> AA <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3))
> BB <- tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003)))
> AA %>% as.data.frame %>% left_join(BB)
Joining, by = "a"
a b c
1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
7 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
8 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
9 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
although this does not:
> AA %>% left_join(BB)
Joining, by = "a"
Error in round_x - lhs :
Arithmetic operators undefined for 'Interval' and 'Interval' classes:
convert one to numeric or a matching time-span class.
Note: I'm using tibble_1.4.1 (same version of lubridate and dplyr as you), on R 3.4.3 for x86_64-pc-linux-gnu
This problem does not exist anymore, as this issue has been closed and a related feature implemented. If you now run the code with updated packages, it will work.
library(lubridate)
library(tidyverse)
tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>%
left_join(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003))))
#> Joining, by = "a"
#> # A tibble: 9 x 3
#> a b c
#> <chr> <Interval> <Interval>
#> 1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 7 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 8 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 9 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
Created on 2019-06-07 by the reprex package (v0.3.0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With