Suppose I have 2 data.frames that I want to merge. Data as follows:
options(scipen = 999)
LHS <- structure(
list(
v1 = c(1, 2, 3, 4, 5, 6),
v2 = c("one", "one", "one", "one", "one", "one"),
v3 = c("two", "two", "two", "two", "two", "two"),
v4 = c("xx", "zz", NA_character_, "yy", NA_character_, NA_character_),
v5 = c("xx", "zz", NA_character_, "yy", NA_character_, NA_character_)),
.Names = c("id", "first var","second var", "var of interest 1", "var of interest 2"),
.typeOf = c("numeric", "character", "character","character","character"),
row.names = c(NA, -6L),
class = "data.frame"
) # end of LHS `structure`
RHS <- structure(
list(
v1 = c(3, 5, 6, 10, 100, 1000, 10000, 100000),
v2 = c("QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY"),
v3 = c("QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY", "QWERTY")),
.Names = c("id", "var of interest 1", "var of interest 2"),
.typeOf = c("numeric", "character","character"),
row.names = c(NA, -8L),
class = "data.frame"
) # end of RHS `structure`
In LHS data.frame some rows with id 3, 5 and 6 have missing values. So I want to use left_join to combine my LHS and RHS, however, I am interested only in rows with missing values in variables of interest. In other words, I want to implement left_join to rows with missing values in particular columns (e.g. var of interest 1 and var of interest 2). Note, that RHS data.frames (I would have more than 1) do not contain data for rows (id) that already present in LHS data.frame.
The only one solution I figured out was to use lapply over all rows and implement left_join for those which contain missing values in particular columns (id 3, 5 and 6). lapply could be upgraded to mclapply to speed up execution time, however, for long data.frames such operation could be pricy.
Another one possible solution is to divide data.frames into two: with and without missing values, perform left_join and combine them together.
I am looking for more gentle solution. Any ides?
UPDATED desired output:
structure(
list(
v1 = c(1, 2, 3, 4, 5, 6),
v2 = c("one", "one", "one", "one", "one", "one"),
v3 = c("two", "two", "two", "two", "two", "two"),
v4 = c("xx", "zz", "QWERTY", "yy", "QWERTY", "QWERTY"),
v5 = c("xx", "zz", "QWERTY", "yy", "QWERTY", "QWERTY")),
.Names = c("id", "first var","second var", "var of interest 1", "var of interest 2"),
.typeOf = c("numeric", "character", "character","character","character"),
row.names = c(NA, -6L),
class = "data.frame"
)
data.table update join would be helpful here.
library(data.table)
setDT(LHS)
setDT(RHS)
cols <- c("var of interest 1", "var of interest 2")
LHS[RHS, on = .(id), (cols) := mget(paste0('i.', cols))]
LHS
# id first var second var var of interest 1 var of interest 2
#1: 1 one two xx xx
#2: 2 one two zz zz
#3: 3 one two QWERTY QWERTY
#4: 4 one two yy yy
#5: 5 one two QWERTY QWERTY
#6: 6 one two QWERTY QWERTY
The following tidyverse way may also help in doing as expected for any number of vars of interest
LHS %>% left_join(RHS, by = 'id') %>%
mutate(across(ends_with('.x'), ~coalesce(., get(str_replace(cur_column(), '.x', '.y'))))) %>%
select(!(ends_with('.y')))
id first var second var var of interest 1.x var of interest 2.x
1 1 one two xx xx
2 2 one two zz zz
3 3 one two QWERTY QWERTY
4 4 one two yy yy
5 5 one two QWERTY QWERTY
6 6 one two QWERTY QWERTY
If you want to rename back your cols, do this
LHS %>% left_join(RHS, by = 'id') %>%
mutate(across(ends_with('.x'), ~coalesce(., get(str_replace(cur_column(), '.x', '.y'))))) %>%
select(!(ends_with('.y'))) %>%
rename_with(~str_remove(., '.x'), ends_with('.x'))
id first var second var var of interest 1 var of interest 2
1 1 one two xx xx
2 2 one two zz zz
3 3 one two QWERTY QWERTY
4 4 one two yy yy
5 5 one two QWERTY QWERTY
6 6 one two QWERTY QWERTY
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With