left_join
works as expected with NA values on tibbles or data frames, but on tbl it seems it does not match NAs, even with the option na_matches = "na".
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin18.6.0 (64-bit)
Running under: macOS Mojave 10.14.6
...
other attached packages:
[1] reprex_0.3.0 dbplyr_1.4.2 lubridate_1.7.4 magrittr_1.5 forcats_0.4.0 stringr_1.4.0 dplyr_0.8.1 purrr_0.3.2 readr_1.3.1
[10] tidyr_0.8.3 tibble_2.1.3 ggplot2_3.2.0 tidyverse_1.2.1
...
The following is a reprex with SQLite, but the same happens with PostgreSQL (I actually stumbled upon that problem with a PostgreSQL DB).
(1) I create 2 data frames, copy them locally to SQLite DB, then load them again as tbl.
library(tidyverse)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
df_1 <- tibble(A = c("a", "aa"), B = c("b", "bb"), D = c("d", NA))
df_2 <- tibble(A = c("a", "aa"), C = c("c", "cc"), D = c("d", NA))
copy_to(con, df_1, overwrite = T)
copy_to(con, df_2, overwrite = T)
dt_1 <- tbl(con, "df_1")
dt_2 <- tbl(con, "df_2")
df_1
#> # A tibble: 2 x 3
#> A B D
#> <chr> <chr> <chr>
#> 1 a b d
#> 2 aa bb <NA>
df_2
#> # A tibble: 2 x 3
#> A C D
#> <chr> <chr> <chr>
#> 1 a c d
#> 2 aa cc <NA>
dt_1
#> # Source: table<df_1> [?? x 3]
#> # Database: sqlite 3.29.0 [:memory:]
#> A B D
#> <chr> <chr> <chr>
#> 1 a b d
#> 2 aa bb <NA>
dt_2
#> # Source: table<df_2> [?? x 3]
#> # Database: sqlite 3.29.0 [:memory:]
#> A C D
#> <chr> <chr> <chr>
#> 1 a c d
#> 2 aa cc <NA>
(2) Then I use left_join
, first on the data frames, then on the tbls:
left_join(df_1, df_2)
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb <NA> cc
left_join(dt_1, dt_2, na_matches = "na")
#> Joining, by = c("A", "D")
#> # Source: lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb <NA> <NA>
We can see that the second row last column C
has the expected cc
in the case of data frames (by default na_matches = "na"
) but <NA>
in the case of tbl even with the explicit option na_matches = "na"
(which is the default according to the doc). This is unexpected.
Notice that this is the same result as in the case of data frames with na_matches = "never"
:
left_join(df_1, df_2, na_matches = "never")
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb <NA> <NA>
BTW, the title mentions left_join
because it's the most common join, but same problem arises with inner_join
(full_join
is not yet out for data tables), perhaps even more visibly if we keep the na_matches = "na"
in both:
inner_join(dt_1, dt_2, na_matches = "na")
#> Joining, by = c("A", "D")
#> # Source: lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
inner_join(df_1, df_2, na_matches = "na")
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb <NA> cc
full_join() return all rows and all columns from both x and y . Where there are not matching values, returns NA for the one missing. return all rows from x where there are matching values in y , keeping just columns from x .
Joins with dplyr. The dplyr package uses SQL database syntax for its join functions. A left join means: Include everything on the left (what was the x data frame in merge() ) and all rows that match from the right (y) data frame. If the join columns have the same name, all you need is left_join(x, y) .
dplyr package provides several functions to join data frames in R. In R, Inner join or natural join is the default join and it's mostly used joining data frames, it is used to join data. frames on a specified single or multiple columns, and where column values don't match the rows get dropped from both data.
To respond to @philipxy ’s request to dig a little further in the left_join process I went on debug mode for left_join
, first on data tables:
debug(left_join)
left_join(dt_1, dt_2, na_matches = "na")
#> debugging in: left_join(dt_1, dt_2, na_matches = "na")
#> debug: {
#> UseMethod("left_join")
#> }
Browse[2]> n
#> debug: UseMethod("left_join")
#> Browse[2]> n
#> debugging in: left_join.tbl_lazy(dt_1, dt_2, na_matches = "na")
#> debug: {
#> add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#> suffix = suffix, auto_index = auto_index, ...)
#> }
Browse[3]>
#> debug: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#> suffix = suffix, auto_index = auto_index, ...)
Browse[3]> s
#> debugging in: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#> suffix = suffix, auto_index = auto_index, ...)
#> debug: {
#> if (!is.null(sql_on)) {
#> by <- list(x = character(0), y = character(0), on = sql(sql_on))
#> }
#> else if (identical(type, "full") && identical(by, character())) {
#> type <- "cross"
#> by <- list(x = character(0), y = character(0))
#> }
#> else {
#> by <- common_by(by, x, y)
#> }
#> y <- auto_copy(x, y, copy = copy, indexes = if (auto_index)
#> list(by$y))
#> vars <- join_vars(op_vars(x), op_vars(y), type = type, by = by,
#> suffix = suffix)
#> x$ops <- op_double("join", x, y, args = list(vars = vars,
#> type = type, by = by, suffix = suffix))
#> x
#> }
Browse[4]> f
#> Joining, by = c("A", "D")
#> exiting from: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#> suffix = suffix, auto_index = auto_index, ...)
#> exiting from: left_join.tbl_lazy(dt_1, dt_2, na_matches = "na")
#> exiting from: left_join(dt_1, dt_2, na_matches = "na")
#> # Source: lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb NA NA
We see that left_join
calls left_join.tbl_lazy
on data tables with the na_matches = “na”
option.
However this is followed by a call to add_op_join
the definition of which does not have any mention of na_matches
.
Then, by contrast, on data frames:
left_join(df_1, df_2)
#> debugging in: left_join(df_1, df_2)
#> debug: {
#> UseMethod("left_join")
#> }
Browse[2]> n
#> debug: UseMethod("left_join")
Browse[2]>
#> debugging in: left_join.tbl_df(df_1, df_2)
#> debug: {
#> check_valid_names(tbl_vars(x))
#> check_valid_names(tbl_vars(y))
#> by <- common_by(by, x, y)
#> suffix <- check_suffix(suffix)
#> na_matches <- check_na_matches(na_matches)
#> y <- auto_copy(x, y, copy = copy)
#> vars <- join_vars(tbl_vars(x), tbl_vars(y), by, suffix)
#> by_x <- vars$idx$x$by
#> by_y <- vars$idx$y$by
#> aux_x <- vars$idx$x$aux
#> aux_y <- vars$idx$y$aux
#> out <- left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches,
#> environment())
#> names(out) <- vars$alias
#> reconstruct_join(out, x, vars)
#> }
Browse[3]>
#> debug: check_valid_names(tbl_vars(x))
Browse[3]>
#> debug: check_valid_names(tbl_vars(y))
Browse[3]>
#> debug: by <- common_by(by, x, y)
Browse[3]>
#> Joining, by = c("A", "D")
#> debug: suffix <- check_suffix(suffix)
Browse[3]>
#> debug: na_matches <- check_na_matches(na_matches)
Browse[3]>
#> debug: y <- auto_copy(x, y, copy = copy)
Browse[3]> na_matches
#> [1] TRUE
Browse[3]> f
#> exiting from: left_join.tbl_df(df_1, df_2)
#> exiting from: left_join(df_1, df_2)
#> # A tibble: 2 x 4
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb NA cc
We see that left_join
calls left_join.tbl_df
on data frames. Further down we see that na_matches
is set to TRUE
before being used as argument in left_join_impl
. All this makes sense.
When typing ?left_join.tbl_lazy
the doc returns a local page for join.tbl_sql {dbplyr}
which states for the unspecified arguments (the …
):
“Other parameters passed onto methods, for instance, na_matches to control how NA values are matched. See join.tbl_df
for more”.
Following the join.tbl_df
doc link, it clearly mentions na_matches
:
"Use 'never' to always treat two NA or NaN values as different, like joins for database sources, similarly to merge(incomparables = FALSE). The default, 'na', always treats two NA or NaN values as equal, like merge(). Users and package authors can change the default behavior by calling pkgconfig::set_config('dplyr::na_matches' = 'never')".
So there seems to be some inconsistency between the doc and the code for data tables.
Also, @philipxy mentioned this news link where it is stated "To match NA values, pass na_matches = 'na' to the join verbs; this is only supported for data frames". Now the class of dt_1 and df_1 are:
class(df_1)
#> [1] "tbl_df" "tbl" "data.frame"
class(dt_1)
#> [1] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql"
#> [4] "tbl_lazy" "tbl"
I suppose the term "data frame" refers to classes data.frame
and tbl_df
, and what I called "data table" are the other tbl_*
s including tbl_sql
and tbl_lazy
. So this news link also answers the question.
Still, I think the current documentation for the join verbs is confusing. It should clearly state:
"The defaults is na_matches = 'na'
for data frames and na_matches = 'never'
(with no other choice) for data tables".
Hopefully, the choice na_matches = "na"
will be implemented not too long in the future for data tables.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With