What is the easiest way to do a left outer join on two data tables (dt1, dt2) with the fill value being 0 (or some other value) instead of NA (default) without overwriting valid NA values in the left data table?
A common answer, such as in this thread is to do the left outer join with either dplyr::left_join
or data.table::merge
or data.table
's dt2[dt1] keyed column bracket syntax, followed by a second step simply replacing all NA
values by 0
in the joined data table. For example:
library(data.table);
dt1 <- data.table(x=c('a', 'b', 'c', 'd', 'e'), y=c(NA, 'w', NA, 'y', 'z'));
dt2 <- data.table(x=c('a', 'b', 'c'), new_col=c(1,2,3));
setkey(dt1, x);
setkey(dt2, x);
merged_tables <- dt2[dt1];
merged_tables[is.na(merged_tables)] <- 0;
This approach necessarily assumes that there are no valid NA values in dt1
that need to be preserved. Yet, as you can see in the above example, the results are:
x new_col y
1: a 1 0
2: b 2 w
3: c 3 0
4: d 0 y
5: e 0 z
but the desired results are:
x new_col y
1: a 1 NA
2: b 2 w
3: c 3 NA
4: d 0 y
5: e 0 z
In such a trivial case, instead of using the data.table
all elements replace syntax as above, just the NA values in new_col
could be replaced:
library(dplyr);
merged_tables <- mutate(merged_tables, new_col = ifelse(is.na(new_col), 0, new_col));
However, this approach is not practical for very large data sets where dozens or hundreds of new columns are merged, sometimes with dynamically created column names. Even if the column names were all known ahead of time, it's very ugly to list out all the new columns and do a mutate-style replace on each one.
There must be a better way? The issue would be simply resolved if the syntax of any of dplyr::left_join
, data.table::merge
, or data.table
's bracket easily allowed the user to specify a fill
value other than NA. Something like:
merged_tables <- data.table::merge(dt1, dt2, by="x", all.x=TRUE, fill=0);
data.table
's dcast
function allows the user to specify fill
value, so I figure there must be an easier way to do this that I'm just not thinking of.
Suggestions?
EDIT: @jangorecki pointed out in the comments that there is a feature request currently open on the data.table
GitHug page to do exactly what I just mentioned, updating the nomatch=0
syntax. Should be in the next release of data.table
.
You can replace NA values with zero(0) on numeric columns of R data frame by using is.na() , replace() , imputeTS::replace() , dplyr::coalesce() , dplyr::mutate_at() , dplyr::mutate_if() , and tidyr::replace_na() functions.
The syntax for the LEFT OUTER JOIN in MySQL is: SELECT columns FROM table1 LEFT [OUTER] JOIN table2 ON table1.
A left outer join is a method of combining tables. The result includes unmatched rows from only the table that is specified before the LEFT OUTER JOIN clause. If you are joining two tables and want the result set to include unmatched rows from only one table, use a LEFT OUTER JOIN clause or a RIGHT OUTER JOIN clause.
A LEFT OUTER JOIN performs an inner join of two tables (supposed table A which writes before the join keyword and table B which writes after the join keyword in the SQL statement ) based on the condition specified after the ON keyword. It returns all rows from the table A as well as the unmatched rows from the table B.
I stumbled on the same problem with dplyr and wrote a small function that solved my problem. (the solution requires tidyr and dplyr)
left_join0 <- function(x, y, fill = 0L, ...){
z <- left_join(x, y, ...)
new_cols <- setdiff(names(z), names(x))
z <- replace_na(z, setNames(as.list(rep(fill, length(new_cols))), new_cols))
z
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With