Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

left_join for tbl: na_matches not working

left_join works as expected with NA values on tibbles or data frames, but on tbl it seems it does not match NAs, even with the option na_matches = "na".

R version and package versions

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin18.6.0 (64-bit)
Running under: macOS Mojave 10.14.6
...
other attached packages:
 [1] reprex_0.3.0    dbplyr_1.4.2    lubridate_1.7.4 magrittr_1.5    forcats_0.4.0   stringr_1.4.0   dplyr_0.8.1     purrr_0.3.2     readr_1.3.1
[10] tidyr_0.8.3     tibble_2.1.3    ggplot2_3.2.0   tidyverse_1.2.1
...

The following is a reprex with SQLite, but the same happens with PostgreSQL (I actually stumbled upon that problem with a PostgreSQL DB).

Minimal reprex.

(1) I create 2 data frames, copy them locally to SQLite DB, then load them again as tbl.

library(tidyverse)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
df_1 <- tibble(A = c("a", "aa"), B = c("b", "bb"), D = c("d", NA))
df_2 <- tibble(A = c("a", "aa"), C = c("c", "cc"), D = c("d", NA))
copy_to(con, df_1, overwrite = T)
copy_to(con, df_2, overwrite = T)
dt_1 <- tbl(con, "df_1")
dt_2 <- tbl(con, "df_2")

df_1
#> # A tibble: 2 x 3
#>   A     B     D    
#>   <chr> <chr> <chr>
#> 1 a     b     d    
#> 2 aa    bb    <NA>

df_2
#> # A tibble: 2 x 3
#>   A     C     D    
#>   <chr> <chr> <chr>
#> 1 a     c     d    
#> 2 aa    cc    <NA>

dt_1
#> # Source:   table<df_1> [?? x 3]
#> # Database: sqlite 3.29.0 [:memory:]
#>   A     B     D    
#>   <chr> <chr> <chr>
#> 1 a     b     d    
#> 2 aa    bb    <NA>

dt_2
#> # Source:   table<df_2> [?? x 3]
#> # Database: sqlite 3.29.0 [:memory:]
#>   A     C     D    
#>   <chr> <chr> <chr>
#> 1 a     c     d    
#> 2 aa    cc    <NA>

(2) Then I use left_join, first on the data frames, then on the tbls:

left_join(df_1, df_2)
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#>   A     B     D     C    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c    
#> 2 aa    bb    <NA>  cc

left_join(dt_1, dt_2, na_matches = "na")
#> Joining, by = c("A", "D")
#> # Source:   lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#>   A     B     D     C    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c    
#> 2 aa    bb    <NA>  <NA>

We can see that the second row last column C has the expected cc in the case of data frames (by default na_matches = "na") but <NA> in the case of tbl even with the explicit option na_matches = "na" (which is the default according to the doc). This is unexpected.

EDIT

Notice that this is the same result as in the case of data frames with na_matches = "never":


left_join(df_1, df_2, na_matches = "never")
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#>   A     B     D     C    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c    
#> 2 aa    bb    <NA>  <NA>

BTW, the title mentions left_join because it's the most common join, but same problem arises with inner_join (full_join is not yet out for data tables), perhaps even more visibly if we keep the na_matches = "na" in both:

inner_join(dt_1, dt_2, na_matches = "na")
#> Joining, by = c("A", "D")
#> # Source:   lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#>   A     B     D     C    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c
inner_join(df_1, df_2, na_matches = "na")
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#>   A     B     D     C    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c    
#> 2 aa    bb    <NA>  cc
like image 405
Habert Avatar asked Aug 31 '19 02:08

Habert


People also ask

How does Full_join work in R?

full_join() return all rows and all columns from both x and y . Where there are not matching values, returns NA for the one missing. return all rows from x where there are matching values in y , keeping just columns from x .

How do I join tables in Dplyr?

Joins with dplyr. The dplyr package uses SQL database syntax for its join functions. A left join means: Include everything on the left (what was the x data frame in merge() ) and all rows that match from the right (y) data frame. If the join columns have the same name, all you need is left_join(x, y) .

What is inner join Dplyr?

dplyr package provides several functions to join data frames in R. In R, Inner join or natural join is the default join and it's mostly used joining data frames, it is used to join data. frames on a specified single or multiple columns, and where column values don't match the rows get dropped from both data.


1 Answers

To respond to @philipxy ’s request to dig a little further in the left_join process I went on debug mode for left_join, first on data tables:

debug(left_join)
left_join(dt_1, dt_2, na_matches = "na")
#>  debugging in: left_join(dt_1, dt_2, na_matches = "na")
#>  debug: {
#>      UseMethod("left_join")
#>  }
Browse[2]>  n
#>  debug: UseMethod("left_join")
#>  Browse[2]> n
#>  debugging in: left_join.tbl_lazy(dt_1, dt_2, na_matches = "na")
#>  debug: {
#>      add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#>          suffix = suffix, auto_index = auto_index, ...)
#>  }
Browse[3]>
#>  debug: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#>      suffix = suffix, auto_index = auto_index, ...)
Browse[3]> s
#>  debugging in: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#>      suffix = suffix, auto_index = auto_index, ...)
#>  debug: {
#>      if (!is.null(sql_on)) {
#>         by <- list(x = character(0), y = character(0), on = sql(sql_on))
#>      }
#>      else if (identical(type, "full") && identical(by, character())) {
#>          type <- "cross"
#>          by <- list(x = character(0), y = character(0))
#>      }
#>      else {
#>          by <- common_by(by, x, y)
#>      }
#>      y <- auto_copy(x, y, copy = copy, indexes = if (auto_index)
#>          list(by$y))
#>      vars <- join_vars(op_vars(x), op_vars(y), type = type, by = by,
#>          suffix = suffix)
#>      x$ops <- op_double("join", x, y, args = list(vars = vars,
#>          type = type, by = by, suffix = suffix))
#>      x
#>  }
Browse[4]> f
#>  Joining, by = c("A", "D")
#>  exiting from: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#>      suffix = suffix, auto_index = auto_index, ...)
#>  exiting from: left_join.tbl_lazy(dt_1, dt_2, na_matches = "na")
#>  exiting from: left_join(dt_1, dt_2, na_matches = "na")
#>  # Source:   lazy query [?? x 4]
#>  # Database: sqlite 3.29.0 [:memory:]
#>    A     B     D     C
#>    <chr> <chr> <chr> <chr>
#>  1 a     b     d     c
#>  2 aa    bb    NA    NA

We see that left_join calls left_join.tbl_lazy on data tables with the na_matches = “na” option. However this is followed by a call to add_op_join the definition of which does not have any mention of na_matches.

Then, by contrast, on data frames:

left_join(df_1, df_2)
#>  debugging in: left_join(df_1, df_2)
#>  debug: {
#>      UseMethod("left_join")
#>  }
Browse[2]> n
#>  debug: UseMethod("left_join")
Browse[2]>
#>  debugging in: left_join.tbl_df(df_1, df_2)
#>  debug: {
#>      check_valid_names(tbl_vars(x))
#>      check_valid_names(tbl_vars(y))
#>      by <- common_by(by, x, y)
#>      suffix <- check_suffix(suffix)
#>      na_matches <- check_na_matches(na_matches)
#>      y <- auto_copy(x, y, copy = copy)
#>      vars <- join_vars(tbl_vars(x), tbl_vars(y), by, suffix)
#>      by_x <- vars$idx$x$by
#>      by_y <- vars$idx$y$by
#>      aux_x <- vars$idx$x$aux
#>      aux_y <- vars$idx$y$aux
#>      out <- left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches,
#>          environment())
#>      names(out) <- vars$alias
#>      reconstruct_join(out, x, vars)
#>  }
Browse[3]>
#>  debug: check_valid_names(tbl_vars(x))
Browse[3]>
#>  debug: check_valid_names(tbl_vars(y))
Browse[3]>
#>  debug: by <- common_by(by, x, y)
Browse[3]>
#>  Joining, by = c("A", "D")
#>  debug: suffix <- check_suffix(suffix)
Browse[3]>
#>  debug: na_matches <- check_na_matches(na_matches)
Browse[3]>
#>  debug: y <- auto_copy(x, y, copy = copy)
Browse[3]> na_matches
#>  [1] TRUE
Browse[3]> f
#>  exiting from: left_join.tbl_df(df_1, df_2)
#>  exiting from: left_join(df_1, df_2)
#>  # A tibble: 2 x 4
#>    A     B     D     C
#>    <chr> <chr> <chr> <chr>
#>  1 a     b     d     c
#>  2 aa    bb    NA    cc

We see that left_join calls left_join.tbl_df on data frames. Further down we see that na_matches is set to TRUE before being used as argument in left_join_impl. All this makes sense.

When typing ?left_join.tbl_lazy the doc returns a local page for join.tbl_sql {dbplyr} which states for the unspecified arguments (the ):

“Other parameters passed onto methods, for instance, na_matches to control how NA values are matched. See join.tbl_df for more”.

Following the join.tbl_df doc link, it clearly mentions na_matches:

"Use 'never' to always treat two NA or NaN values as different, like joins for database sources, similarly to merge(incomparables = FALSE). The default, 'na', always treats two NA or NaN values as equal, like merge(). Users and package authors can change the default behavior by calling pkgconfig::set_config('dplyr::na_matches' = 'never')".

So there seems to be some inconsistency between the doc and the code for data tables.

Also, @philipxy mentioned this news link where it is stated "To match NA values, pass na_matches = 'na' to the join verbs; this is only supported for data frames". Now the class of dt_1 and df_1 are:

class(df_1)
#>  [1] "tbl_df"     "tbl"        "data.frame"
class(dt_1)
#>  [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"
#>  [4] "tbl_lazy"             "tbl"

I suppose the term "data frame" refers to classes data.frame and tbl_df, and what I called "data table" are the other tbl_*s including tbl_sql and tbl_lazy. So this news link also answers the question.

Still, I think the current documentation for the join verbs is confusing. It should clearly state:

"The defaults is na_matches = 'na' for data frames and na_matches = 'never' (with no other choice) for data tables".

Hopefully, the choice na_matches = "na" will be implemented not too long in the future for data tables.

like image 74
Habert Avatar answered Oct 13 '22 18:10

Habert