What is the easiest way to do a left outer join on two data tables (dt1, dt2) with the fill value being 0 (or some other value) instead of NA (default) without overwriting valid NA values in the left data table? A common answer, such as in this thread is to do the left outer join with either <code>dplyr::left_join</code> or <code>data.table::merge</code> or <code>data.table</code>'s dt2[dt1] keyed column bracket syntax, followed by a second step simply replacing all <code>NA</code> values by <code>0</code> in the joined data table. For example: <pre class="prettyprint"><code>library(data.table); dt1 <- data.table(x=c('a', 'b', 'c', 'd', 'e'), y=c(NA, 'w', NA, 'y', 'z')); dt2 <- data.table(x=c('a', 'b', 'c'), new_col=c(1,2,3)); setkey(dt1, x); setkey(dt2, x); merged_tables <- dt2[dt1]; merged_tables[is.na(merged_tables)] <- 0; </code></pre> This approach necessarily assumes that there are no valid NA values in <code>dt1</code> that need to be preserved. Yet, as you can see in the above example, the results are: <pre class="prettyprint"><code> x new_col y 1: a 1 0 2: b 2 w 3: c 3 0 4: d 0 y 5: e 0 z </code></pre> but the desired results are: <pre class="prettyprint"><code> x new_col y 1: a 1 NA 2: b 2 w 3: c 3 NA 4: d 0 y 5: e 0 z </code></pre> In such a trivial case, instead of using the <code>data.table</code> all elements replace syntax as above, just the NA values in <code>new_col</code> could be replaced: <pre class="prettyprint"><code>library(dplyr); merged_tables <- mutate(merged_tables, new_col = ifelse(is.na(new_col), 0, new_col)); </code></pre> However, this approach is not practical for very large data sets where dozens or hundreds of new columns are merged, sometimes with dynamically created column names. Even if the column names were all known ahead of time, it's very ugly to list out all the new columns and do a mutate-style replace on each one. There must be a better way? The issue would be simply resolved if the syntax of any of <code>dplyr::left_join</code>, <code>data.table::merge</code>, or <code>data.table</code>'s bracket easily allowed the user to specify a <code>fill</code> value other than NA. Something like: <pre class="prettyprint"><code>merged_tables <- data.table::merge(dt1, dt2, by="x", all.x=TRUE, fill=0); </code></pre> <code>data.table</code>'s <code>dcast</code> function allows the user to specify <code>fill</code> value, so I figure there must be an easier way to do this that I'm just not thinking of. Suggestions? EDIT: @jangorecki pointed out in the comments that there is a feature request currently open on the <code>data.table</code> GitHug page to do exactly what I just mentioned, updating the <code>nomatch=0</code> syntax. Should be in the next release of <code>data.table</code>.

I stumbled on the same problem with dplyr and wrote a small function that solved my problem. (the solution requires tidyr and dplyr) <pre class="prettyprint"><code>left_join0 <- function(x, y, fill = 0L, ...){ z <- left_join(x, y, ...) new_cols <- setdiff(names(z), names(x)) z <- replace_na(z, setNames(as.list(rep(fill, length(new_cols))), new_cols)) z } </code></pre>

R Left Outer Join with 0 Fill Instead of NA While Preserving Valid NA's in Left Table

Tags:

merge

r

left-join

data.table

dplyr

What is the easiest way to do a left outer join on two data tables (dt1, dt2) with the fill value being 0 (or some other value) instead of NA (default) without overwriting valid NA values in the left data table?

A common answer, such as in this thread is to do the left outer join with either dplyr::left_join or data.table::merge or data.table's dt2[dt1] keyed column bracket syntax, followed by a second step simply replacing all NA values by 0 in the joined data table. For example:

library(data.table);
dt1 <- data.table(x=c('a', 'b', 'c', 'd', 'e'), y=c(NA, 'w', NA, 'y', 'z'));
dt2 <- data.table(x=c('a', 'b', 'c'), new_col=c(1,2,3));
setkey(dt1, x);
setkey(dt2, x);
merged_tables <- dt2[dt1];
merged_tables[is.na(merged_tables)] <- 0;

This approach necessarily assumes that there are no valid NA values in dt1 that need to be preserved. Yet, as you can see in the above example, the results are:

   x new_col y
1: a       1 0
2: b       2 w
3: c       3 0
4: d       0 y
5: e       0 z

but the desired results are:

   x new_col y
1: a       1 NA
2: b       2 w
3: c       3 NA
4: d       0 y
5: e       0 z

In such a trivial case, instead of using the data.table all elements replace syntax as above, just the NA values in new_col could be replaced:

library(dplyr);
merged_tables <- mutate(merged_tables, new_col = ifelse(is.na(new_col), 0, new_col));

However, this approach is not practical for very large data sets where dozens or hundreds of new columns are merged, sometimes with dynamically created column names. Even if the column names were all known ahead of time, it's very ugly to list out all the new columns and do a mutate-style replace on each one.

There must be a better way? The issue would be simply resolved if the syntax of any of dplyr::left_join, data.table::merge, or data.table's bracket easily allowed the user to specify a fill value other than NA. Something like:

merged_tables <- data.table::merge(dt1, dt2, by="x", all.x=TRUE, fill=0);

data.table's dcast function allows the user to specify fill value, so I figure there must be an easier way to do this that I'm just not thinking of.

Suggestions?

EDIT: @jangorecki pointed out in the comments that there is a feature request currently open on the data.table GitHug page to do exactly what I just mentioned, updating the nomatch=0 syntax. Should be in the next release of data.table.

270

asked Feb 03 '16 20:02

Mekki MacAulay

1 Answers

I stumbled on the same problem with dplyr and wrote a small function that solved my problem. (the solution requires tidyr and dplyr)

left_join0 <- function(x, y, fill = 0L, ...){
  z <- left_join(x, y, ...)
  new_cols <- setdiff(names(z), names(x))
  z <- replace_na(z, setNames(as.list(rep(fill, length(new_cols))), new_cols))
  z
}

answered Oct 13 '22 05:10

Fernando Macedo

Related questions
                            
                                Optimal database design in terms of query speed to store matrices from R
                            
                                R unary operator overload: risks?
                            
                                gradient shaded confidence interval
                            
                                deparse(substitute(x)) in lapply?
                            
                                R: add calibrated axes to PCA biplot in ggplot2
                            
                                R: optimization based on historical data
                            
                                Amazon EC2 On-Demand Workers for Short Tasks
                            
                                R: Advantages of using a Fortran subroutine with .Call and C/C++ wrapper instead of .Fortran?
                            
                                Error in install.packages: internet routines cannot be loaded using StatET while it works in R console
                            
                                How to use ggplot2's geom_dotplot() with both fill and group
                            
                                What does 'col.names' do in 'as.data.frame' in R? [duplicate]
                            
                                Is it possible to switch between multiple legends when switching between base groups?
                            
                                NA's are being plotted in boxplot ggplot2
                            
                                Citing articles using roxygen2
                            
                                How to pass Rscript -e a multiline string?
                            
                                How to show only part of the plot area of polar ggplot with facet?
                            
                                Calling R as a web service with parameters and load a JSON?
                            
                                What happens to tempfiles created with tempfile() in R?
                            
                                MS Word track changes and RMarkDown
                            
                                Detecting geographic clusters

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With