Summary: I would like to merge two tables by shared id
key as all=true
(full outer join) where instead of columns with the same names being set as var1.x
var2.y
, etc., they are merged as a single column where missing (NA) values in the left table are filled in by values from the right table (in addition to the standard behavior of merge, i.e., appending rows with distinct ids and columns with distinct names).
Details:
I would like to merge + update table1
with table2
based on a shared id
key column such that:
1) If table1
and table2
have columns with the same name (other than id
), the value in table1
is left alone if it exists and replaced by the value in table2
if the value in table1
is NA.
2) If table2 has columns that table1 does not have (different names), they are merged (by id).
3) If table1
has an id
that does not match in table2
, values for different name columns from table2
are NA
4) If table2
has an id
that does not match in table1
, it is added as a new row and the values for the different column names from table1
are NA.
3 & 4 are as with standard merge
with all=TRUE
.
I'm concerned that I have overthought the problem as I cannot find a straightforward way to do this with a merge
or join
that doesn't involve creating ifelse
checks on every column. Real data has ~1000 columns, so would be incredibly long solution to do ifelse
lookups on each one.
Reproducible reduced example:
table1 <- data.table(id =c("id1", "id2", "id3", "id4", "id5", "id6"),
var1=c(1,2,3,4,5, 6),
var2=c("a", "b", NA, "d", NA, "f"),
var3=c(NA, 12, 13, 14, 15, 16));
table2 <- data.table(id =c("id1", "id2", "id3", "id4", "id5", "id8"),
var1=c(1,2,NA,4,5, 8),
var2=c(NA, "b", "c", "d", "e", "h"),
var4=c("foo", "bar", "oof", "rab", NA, "sna"));
desired <- data.table(id=c("id1", "id2", "id3", "id4", "id5", "id6", "id8"),
var1=c(1,2,3,4,5, 6, 8),
var2=c("a", "b", "c", "d", "e", "f", "h"),
var3=c(NA, 12, 13, 14, 15, 16, NA),
var4=c("foo", "bar", "oof", "rab", NA, NA, "sna"));
table1;
id var1 var2 var3
1: id1 1 a NA
2: id2 2 b 12
3: id3 3 NA 13
4: id4 4 d 14
5: id5 5 e 15
6: id6 6 f 16
table2;
id var1 var2 var4
1: id1 1 a foo
2: id2 2 b bar
3: id3 NA c oof
4: id4 4 d rab
5: id5 5 e NA
6: id8 8 h sna
desired
id var1 var2 var3 var4
1: id1 1 a NA foo
2: id2 2 b 12 bar
3: id3 3 c 13 oof
4: id4 4 d 14 rab
5: id5 5 e 15 NA
6: id6 6 f 16 NA
7: id8 8 h NA sna
Explanation of desired output:
For column var1
, table1
had all the values, so it is left alone and the NA
for id3
in table2
is ignored (note that this doesn't include the row merge for different ids described below).
For column var2
, table
was missing the value indexed by id3
, so it is updated from table2
(note that this doesn't include the row merge for different ids described below).
For column var3
, there is no matching column in table2
, so it is kept as is.
For column var4
, there was no column var4
in table1
, so it is merged from table2
by id
key variable.
For row with id6
in table1
, there is no matching id6
in table2
, so the value for column var4
that is only in table2
is NA in the desired
output for row id6
.
For row with id8
in table2
there is no matching id8
in table1
, so the value for column var3
that is only in table1
is NA in the desired
output for row id8
.
Surely there is a straightforward way to do this with data.table
? Efficient solutions are particularly welcome given the size of the real data. The datamerge
package apparently used to do exactly this, but it isn't on CRAN anymore and I can't get it to work on R3.2.3 from zip. Has another package stepped up for this task? There are many other threads that focus on solving this for a one or a couple of columns with known names, but for large number of columns, they don't seem practical.
To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.
The merge() function in base R can be used to merge input dataframes by common columns or row names. The merge() function retains all the row names of the dataframes, behaving similarly to the inner join. The dataframes are combined in order of the appearance in the input function call.
In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package.
Use full_join() , left_join() , right_join() and inner_join() to merge two tables together. Specify the column(s) to match between tables using the by option. Use anti_join() to identify the rows from the first table which do not have a match in the second table.
Here's one way:
com.cols = setdiff(intersect(names(table1), names(table2)), "id")
com.cols.x = paste0(com.cols, ".x")
com.cols.y = paste0(com.cols, ".y")
# create combined table
DT = setkey(merge(table1, table2, by="id", all=TRUE), NULL)
# edit common columns where NAs are present
for (j in seq_along(com.cols))
DT[is.na(get(com.cols.x[j])), (com.cols.x[j]) := get(com.cols.y[j])]
# remove unneeded columns
DT[, (com.cols.y) := NULL]
# rename kept columns
setnames(DT, com.cols.x, com.cols)
identical(DT, desired) # TRUE
It's rather messy to create and work with all these column-names vectors.
Regarding the original question...
Here's another way (without importing new rows from table2
, as in the original post):
com.cols = setdiff(intersect(names(table1), names(table2)), "id")
i.com.cols = paste0("i.", com.cols)
new.cols = c(i.com.cols, setdiff(names(table2), c("id", com.cols)))
# grab columns from table2
table1[table2, (new.cols) := mget(new.cols), on="id"]
# edit common columns where NAs are present
for (j in seq_along(com.cols))
table1[is.na(get(com.cols[j])), (com.cols[j]) := get(i.com.cols[j])]
# remove unneeded columns
table1[, (i.com.cols) := NULL]
This way all steps are modifications of table1
by reference.
Here's another option that avoids explicitly adding the i.
columns to the original table:
com.cols = setdiff(intersect(names(table1), names(table2)), "id")
i.com.cols = paste0("i.", com.cols)
# I'm using the same var names as Frank, but new.cols is strictly the new ones here
new.cols = setdiff(names(table2), names(table1))
# this is easy - the previously absent cols
table1[table2, (new.cols) := mget(new.cols), on = 'id']
# now for the ones that need updating
table1[table2, on = 'id',
(com.cols) := Map(function(col, i.col) pmin(col, i.col, na.rm = T),
mget(com.cols), mget(i.com.cols))]
I have no idea which option is faster - OP can check that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With