<code>left_join</code> works as expected with NA values on tibbles or data frames, but on tbl it seems it does not match NAs, even with the option na_matches = "na". <h3>R version and package versions</h3> <pre class="prettyprint"><code>> sessionInfo() R version 3.6.1 (2019-07-05) Platform: x86_64-apple-darwin18.6.0 (64-bit) Running under: macOS Mojave 10.14.6 ... other attached packages: [1] reprex_0.3.0 dbplyr_1.4.2 lubridate_1.7.4 magrittr_1.5 forcats_0.4.0 stringr_1.4.0 dplyr_0.8.1 purrr_0.3.2 readr_1.3.1 [10] tidyr_0.8.3 tibble_2.1.3 ggplot2_3.2.0 tidyverse_1.2.1 ... </code></pre> The following is a reprex with SQLite, but the same happens with PostgreSQL (I actually stumbled upon that problem with a PostgreSQL DB). <h3>Minimal reprex.</h3> (1) I create 2 data frames, copy them locally to SQLite DB, then load them again as tbl. <pre class="prettyprint"><code>library(tidyverse) con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") df_1 <- tibble(A = c("a", "aa"), B = c("b", "bb"), D = c("d", NA)) df_2 <- tibble(A = c("a", "aa"), C = c("c", "cc"), D = c("d", NA)) copy_to(con, df_1, overwrite = T) copy_to(con, df_2, overwrite = T) dt_1 <- tbl(con, "df_1") dt_2 <- tbl(con, "df_2") df_1 #> # A tibble: 2 x 3 #> A B D #> <chr> <chr> <chr> #> 1 a b d #> 2 aa bb <NA> df_2 #> # A tibble: 2 x 3 #> A C D #> <chr> <chr> <chr> #> 1 a c d #> 2 aa cc <NA> dt_1 #> # Source: table<df_1> [?? x 3] #> # Database: sqlite 3.29.0 [:memory:] #> A B D #> <chr> <chr> <chr> #> 1 a b d #> 2 aa bb <NA> dt_2 #> # Source: table<df_2> [?? x 3] #> # Database: sqlite 3.29.0 [:memory:] #> A C D #> <chr> <chr> <chr> #> 1 a c d #> 2 aa cc <NA> </code></pre> (2) Then I use <code>left_join</code>, first on the data frames, then on the tbls: <pre class="prettyprint"><code>left_join(df_1, df_2) #> Joining, by = c("A", "D") #> # A tibble: 2 x 4 #> A B D C #> <chr> <chr> <chr> <chr> #> 1 a b d c #> 2 aa bb <NA> cc left_join(dt_1, dt_2, na_matches = "na") #> Joining, by = c("A", "D") #> # Source: lazy query [?? x 4] #> # Database: sqlite 3.29.0 [:memory:] #> A B D C #> <chr> <chr> <chr> <chr> #> 1 a b d c #> 2 aa bb <NA> <NA> </code></pre> We can see that the second row last column <code>C</code> has the expected <code>cc</code> in the case of data frames (by default <code>na_matches = "na"</code>) but <code><NA></code> in the case of tbl even with the explicit option <code>na_matches = "na"</code> (which is the default according to the doc). This is unexpected. <h3>EDIT</h3> Notice that this is the same result as in the case of data frames with <code>na_matches = "never"</code>: <pre class="prettyprint"><code> left_join(df_1, df_2, na_matches = "never") #> Joining, by = c("A", "D") #> # A tibble: 2 x 4 #> A B D C #> <chr> <chr> <chr> <chr> #> 1 a b d c #> 2 aa bb <NA> <NA> </code></pre> BTW, the title mentions <code>left_join</code> because it's the most common join, but same problem arises with <code>inner_join</code> (<code>full_join</code> is not yet out for data tables), perhaps even more visibly if we keep the <code>na_matches = "na"</code> in both: <pre class="prettyprint"><code>inner_join(dt_1, dt_2, na_matches = "na") #> Joining, by = c("A", "D") #> # Source: lazy query [?? x 4] #> # Database: sqlite 3.29.0 [:memory:] #> A B D C #> <chr> <chr> <chr> <chr> #> 1 a b d c inner_join(df_1, df_2, na_matches = "na") #> Joining, by = c("A", "D") #> # A tibble: 2 x 4 #> A B D C #> <chr> <chr> <chr> <chr> #> 1 a b d c #> 2 aa bb <NA> cc </code></pre>

To respond to @philipxy ’s request to dig a little further in the left_join process I went on debug mode for <code>left_join</code>, first on data tables: <pre class="prettyprint"><code>debug(left_join) left_join(dt_1, dt_2, na_matches = "na") #> debugging in: left_join(dt_1, dt_2, na_matches = "na") #> debug: { #> UseMethod("left_join") #> } Browse[2]> n #> debug: UseMethod("left_join") #> Browse[2]> n #> debugging in: left_join.tbl_lazy(dt_1, dt_2, na_matches = "na") #> debug: { #> add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy, #> suffix = suffix, auto_index = auto_index, ...) #> } Browse[3]> #> debug: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy, #> suffix = suffix, auto_index = auto_index, ...) Browse[3]> s #> debugging in: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy, #> suffix = suffix, auto_index = auto_index, ...) #> debug: { #> if (!is.null(sql_on)) { #> by <- list(x = character(0), y = character(0), on = sql(sql_on)) #> } #> else if (identical(type, "full") && identical(by, character())) { #> type <- "cross" #> by <- list(x = character(0), y = character(0)) #> } #> else { #> by <- common_by(by, x, y) #> } #> y <- auto_copy(x, y, copy = copy, indexes = if (auto_index) #> list(by$y)) #> vars <- join_vars(op_vars(x), op_vars(y), type = type, by = by, #> suffix = suffix) #> x$ops <- op_double("join", x, y, args = list(vars = vars, #> type = type, by = by, suffix = suffix)) #> x #> } Browse[4]> f #> Joining, by = c("A", "D") #> exiting from: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy, #> suffix = suffix, auto_index = auto_index, ...) #> exiting from: left_join.tbl_lazy(dt_1, dt_2, na_matches = "na") #> exiting from: left_join(dt_1, dt_2, na_matches = "na") #> # Source: lazy query [?? x 4] #> # Database: sqlite 3.29.0 [:memory:] #> A B D C #> <chr> <chr> <chr> <chr> #> 1 a b d c #> 2 aa bb NA NA </code></pre> We see that <code>left_join</code> calls <code>left_join.tbl_lazy</code> on data tables with the <code>na_matches = “na”</code> option. However this is followed by a call to <code>add_op_join</code> the definition of which does not have any mention of <code>na_matches</code>. Then, by contrast, on data frames: <pre class="prettyprint"><code>left_join(df_1, df_2) #> debugging in: left_join(df_1, df_2) #> debug: { #> UseMethod("left_join") #> } Browse[2]> n #> debug: UseMethod("left_join") Browse[2]> #> debugging in: left_join.tbl_df(df_1, df_2) #> debug: { #> check_valid_names(tbl_vars(x)) #> check_valid_names(tbl_vars(y)) #> by <- common_by(by, x, y) #> suffix <- check_suffix(suffix) #> na_matches <- check_na_matches(na_matches) #> y <- auto_copy(x, y, copy = copy) #> vars <- join_vars(tbl_vars(x), tbl_vars(y), by, suffix) #> by_x <- vars$idx$x$by #> by_y <- vars$idx$y$by #> aux_x <- vars$idx$x$aux #> aux_y <- vars$idx$y$aux #> out <- left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches, #> environment()) #> names(out) <- vars$alias #> reconstruct_join(out, x, vars) #> } Browse[3]> #> debug: check_valid_names(tbl_vars(x)) Browse[3]> #> debug: check_valid_names(tbl_vars(y)) Browse[3]> #> debug: by <- common_by(by, x, y) Browse[3]> #> Joining, by = c("A", "D") #> debug: suffix <- check_suffix(suffix) Browse[3]> #> debug: na_matches <- check_na_matches(na_matches) Browse[3]> #> debug: y <- auto_copy(x, y, copy = copy) Browse[3]> na_matches #> [1] TRUE Browse[3]> f #> exiting from: left_join.tbl_df(df_1, df_2) #> exiting from: left_join(df_1, df_2) #> # A tibble: 2 x 4 #> A B D C #> <chr> <chr> <chr> <chr> #> 1 a b d c #> 2 aa bb NA cc </code></pre> We see that <code>left_join</code> calls <code>left_join.tbl_df</code> on data frames. Further down we see that <code>na_matches</code> is set to <code>TRUE</code> before being used as argument in <code>left_join_impl</code>. All this makes sense. When typing <code>?left_join.tbl_lazy</code> the doc returns a local page for <code>join.tbl_sql {dbplyr}</code> which states for the unspecified arguments (the <code>…</code>): “Other parameters passed onto methods, for instance, na_matches to control how NA values are matched. See <code>join.tbl_df</code> for more”. Following the <code>join.tbl_df</code> doc link, it clearly mentions <code>na_matches</code>: "Use 'never' to always treat two NA or NaN values as different, like joins for database sources, similarly to merge(incomparables = FALSE). The default, 'na', always treats two NA or NaN values as equal, like merge(). Users and package authors can change the default behavior by calling pkgconfig::set_config('dplyr::na_matches' = 'never')". So there seems to be some inconsistency between the doc and the code for data tables. Also, @philipxy mentioned this news link where it is stated "To match NA values, pass na_matches = 'na' to the join verbs; this is only supported for data frames". Now the class of dt_1 and df_1 are: <pre class="prettyprint"><code>class(df_1) #> [1] "tbl_df" "tbl" "data.frame" class(dt_1) #> [1] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql" #> [4] "tbl_lazy" "tbl" </code></pre> I suppose the term "data frame" refers to classes <code>data.frame</code> and <code>tbl_df</code>, and what I called "data table" are the other <code>tbl_*</code>s including <code>tbl_sql</code> and <code>tbl_lazy</code>. So this news link also answers the question. Still, I think the current documentation for the join verbs is confusing. It should clearly state: "The defaults is <code>na_matches = 'na'</code> for data frames and <code>na_matches = 'never'</code> (with no other choice) for data tables". Hopefully, the choice <code>na_matches = "na"</code> will be implemented not too long in the future for data tables.

left_join for tbl: na_matches not working

R version and package versions

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin18.6.0 (64-bit)
Running under: macOS Mojave 10.14.6
...
other attached packages:
 [1] reprex_0.3.0    dbplyr_1.4.2    lubridate_1.7.4 magrittr_1.5    forcats_0.4.0   stringr_1.4.0   dplyr_0.8.1     purrr_0.3.2     readr_1.3.1
[10] tidyr_0.8.3     tibble_2.1.3    ggplot2_3.2.0   tidyverse_1.2.1
...

The following is a reprex with SQLite, but the same happens with PostgreSQL (I actually stumbled upon that problem with a PostgreSQL DB).

Minimal reprex.

(1) I create 2 data frames, copy them locally to SQLite DB, then load them again as tbl.

library(tidyverse)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
df_1 <- tibble(A = c("a", "aa"), B = c("b", "bb"), D = c("d", NA))
df_2 <- tibble(A = c("a", "aa"), C = c("c", "cc"), D = c("d", NA))
copy_to(con, df_1, overwrite = T)
copy_to(con, df_2, overwrite = T)
dt_1 <- tbl(con, "df_1")
dt_2 <- tbl(con, "df_2")

df_1
#> # A tibble: 2 x 3
#>   A     B     D    
#>   <chr> <chr> <chr>
#> 1 a     b     d    
#> 2 aa    bb    <NA>

df_2
#> # A tibble: 2 x 3
#>   A     C     D    
#>   <chr> <chr> <chr>
#> 1 a     c     d    
#> 2 aa    cc    <NA>

dt_1
#> # Source:   table<df_1> [?? x 3]
#> # Database: sqlite 3.29.0 [:memory:]
#>   A     B     D    
#>   <chr> <chr> <chr>
#> 1 a     b     d    
#> 2 aa    bb    <NA>

dt_2
#> # Source:   table<df_2> [?? x 3]
#> # Database: sqlite 3.29.0 [:memory:]
#>   A     C     D    
#>   <chr> <chr> <chr>
#> 1 a     c     d    
#> 2 aa    cc    <NA>

(2) Then I use left_join, first on the data frames, then on the tbls:

left_join(df_1, df_2)
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#>   A     B     D     C    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c    
#> 2 aa    bb    <NA>  cc

left_join(dt_1, dt_2, na_matches = "na")
#> Joining, by = c("A", "D")
#> # Source:   lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#>   A     B     D     C    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c    
#> 2 aa    bb    <NA>  <NA>

We can see that the second row last column C has the expected cc in the case of data frames (by default na_matches = "na") but <NA> in the case of tbl even with the explicit option na_matches = "na" (which is the default according to the doc). This is unexpected.

EDIT

Notice that this is the same result as in the case of data frames with na_matches = "never":


left_join(df_1, df_2, na_matches = "never")
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#>   A     B     D     C    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c    
#> 2 aa    bb    <NA>  <NA>

BTW, the title mentions left_join because it's the most common join, but same problem arises with inner_join (full_join is not yet out for data tables), perhaps even more visibly if we keep the na_matches = "na" in both:

inner_join(dt_1, dt_2, na_matches = "na")
#> Joining, by = c("A", "D")
#> # Source:   lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#>   A     B     D     C    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c
inner_join(df_1, df_2, na_matches = "na")
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#>   A     B     D     C    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c    
#> 2 aa    bb    <NA>  cc

405

asked Aug 31 '19 02:08

Habert

1 Answers

To respond to @philipxy ’s request to dig a little further in the left_join process I went on debug mode for left_join, first on data tables:

debug(left_join)
left_join(dt_1, dt_2, na_matches = "na")
#>  debugging in: left_join(dt_1, dt_2, na_matches = "na")
#>  debug: {
#>      UseMethod("left_join")
#>  }
Browse[2]>  n
#>  debug: UseMethod("left_join")
#>  Browse[2]> n
#>  debugging in: left_join.tbl_lazy(dt_1, dt_2, na_matches = "na")
#>  debug: {
#>      add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#>          suffix = suffix, auto_index = auto_index, ...)
#>  }
Browse[3]>
#>  debug: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#>      suffix = suffix, auto_index = auto_index, ...)
Browse[3]> s
#>  debugging in: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#>      suffix = suffix, auto_index = auto_index, ...)
#>  debug: {
#>      if (!is.null(sql_on)) {
#>         by <- list(x = character(0), y = character(0), on = sql(sql_on))
#>      }
#>      else if (identical(type, "full") && identical(by, character())) {
#>          type <- "cross"
#>          by <- list(x = character(0), y = character(0))
#>      }
#>      else {
#>          by <- common_by(by, x, y)
#>      }
#>      y <- auto_copy(x, y, copy = copy, indexes = if (auto_index)
#>          list(by$y))
#>      vars <- join_vars(op_vars(x), op_vars(y), type = type, by = by,
#>          suffix = suffix)
#>      x$ops <- op_double("join", x, y, args = list(vars = vars,
#>          type = type, by = by, suffix = suffix))
#>      x
#>  }
Browse[4]> f
#>  Joining, by = c("A", "D")
#>  exiting from: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#>      suffix = suffix, auto_index = auto_index, ...)
#>  exiting from: left_join.tbl_lazy(dt_1, dt_2, na_matches = "na")
#>  exiting from: left_join(dt_1, dt_2, na_matches = "na")
#>  # Source:   lazy query [?? x 4]
#>  # Database: sqlite 3.29.0 [:memory:]
#>    A     B     D     C
#>    <chr> <chr> <chr> <chr>
#>  1 a     b     d     c
#>  2 aa    bb    NA    NA

We see that left_join calls left_join.tbl_lazy on data tables with the na_matches = “na” option. However this is followed by a call to add_op_join the definition of which does not have any mention of na_matches.

Then, by contrast, on data frames:

left_join(df_1, df_2)
#>  debugging in: left_join(df_1, df_2)
#>  debug: {
#>      UseMethod("left_join")
#>  }
Browse[2]> n
#>  debug: UseMethod("left_join")
Browse[2]>
#>  debugging in: left_join.tbl_df(df_1, df_2)
#>  debug: {
#>      check_valid_names(tbl_vars(x))
#>      check_valid_names(tbl_vars(y))
#>      by <- common_by(by, x, y)
#>      suffix <- check_suffix(suffix)
#>      na_matches <- check_na_matches(na_matches)
#>      y <- auto_copy(x, y, copy = copy)
#>      vars <- join_vars(tbl_vars(x), tbl_vars(y), by, suffix)
#>      by_x <- vars$idx$x$by
#>      by_y <- vars$idx$y$by
#>      aux_x <- vars$idx$x$aux
#>      aux_y <- vars$idx$y$aux
#>      out <- left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches,
#>          environment())
#>      names(out) <- vars$alias
#>      reconstruct_join(out, x, vars)
#>  }
Browse[3]>
#>  debug: check_valid_names(tbl_vars(x))
Browse[3]>
#>  debug: check_valid_names(tbl_vars(y))
Browse[3]>
#>  debug: by <- common_by(by, x, y)
Browse[3]>
#>  Joining, by = c("A", "D")
#>  debug: suffix <- check_suffix(suffix)
Browse[3]>
#>  debug: na_matches <- check_na_matches(na_matches)
Browse[3]>
#>  debug: y <- auto_copy(x, y, copy = copy)
Browse[3]> na_matches
#>  [1] TRUE
Browse[3]> f
#>  exiting from: left_join.tbl_df(df_1, df_2)
#>  exiting from: left_join(df_1, df_2)
#>  # A tibble: 2 x 4
#>    A     B     D     C
#>    <chr> <chr> <chr> <chr>
#>  1 a     b     d     c
#>  2 aa    bb    NA    cc

We see that left_join calls left_join.tbl_df on data frames. Further down we see that na_matches is set to TRUE before being used as argument in left_join_impl. All this makes sense.

When typing ?left_join.tbl_lazy the doc returns a local page for join.tbl_sql {dbplyr} which states for the unspecified arguments (the …):

“Other parameters passed onto methods, for instance, na_matches to control how NA values are matched. See join.tbl_df for more”.

Following the join.tbl_df doc link, it clearly mentions na_matches:

"Use 'never' to always treat two NA or NaN values as different, like joins for database sources, similarly to merge(incomparables = FALSE). The default, 'na', always treats two NA or NaN values as equal, like merge(). Users and package authors can change the default behavior by calling pkgconfig::set_config('dplyr::na_matches' = 'never')".

So there seems to be some inconsistency between the doc and the code for data tables.

Also, @philipxy mentioned this news link where it is stated "To match NA values, pass na_matches = 'na' to the join verbs; this is only supported for data frames". Now the class of dt_1 and df_1 are:

class(df_1)
#>  [1] "tbl_df"     "tbl"        "data.frame"
class(dt_1)
#>  [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"
#>  [4] "tbl_lazy"             "tbl"

I suppose the term "data frame" refers to classes data.frame and tbl_df, and what I called "data table" are the other tbl_*s including tbl_sql and tbl_lazy. So this news link also answers the question.

Still, I think the current documentation for the join verbs is confusing. It should clearly state:

"The defaults is na_matches = 'na' for data frames and na_matches = 'never' (with no other choice) for data tables".

Hopefully, the choice na_matches = "na" will be implemented not too long in the future for data tables.

answered Oct 13 '22 18:10

Habert

Related questions
                            
                                Font family won't change in ggplot [duplicate]
                            
                                Facet function in highcharts
                            
                                Valgrind/R is not working: "Fatal error: cannot create 'R_TempDir'"
                            
                                rmarkdown::render problem when called from a package
                            
                                Get PID for subprocesses for asynchronous futures in R shiny
                            
                                Change axis text direction to right-to-left
                            
                                CRAN-acceptable way of linking to OpenMP some C code called from Rcpp
                            
                                Add a particular version of R to a docker container
                            
                                image as axis tick ggplot
                            
                                How to find the joint cumulative distribution function from a 2-D copula in R?
                            
                                Discrepancies between R optim vs Scipy optimize: Nelder-Mead
                            
                                Transparent lookup table for numeric values without using data.frame?
                            
                                Conditionally include chapters in Bookdown
                            
                                Adding a horizontal line to a plotly bar graph
                            
                                How to update the leaflet map in the selectModUI in a Shiny app?
                            
                                Splitting strings into number and string (with missings)
                            
                                fitting first order equation with nlme and lsoda
                            
                                How can I make this histogram in ggplot / R?
                            
                                compact/efficient replacement for diag(X V X^T)?
                            
                                Unexpected behaviour when piping to ls() in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

left_join for tbl: na_matches not working

Tags:

sqlite

r

left-join

dplyr

dbplyr

R version and package versions

Minimal reprex.

EDIT

Habert

People also ask

1 Answers

Habert

Recent Activity

Donate For Us