Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Efficient Way to Merge+Update Table With Second Table Where Values from Same Column Names Fill NAs

Summary: I would like to merge two tables by shared id key as all=true (full outer join) where instead of columns with the same names being set as var1.x var2.y, etc., they are merged as a single column where missing (NA) values in the left table are filled in by values from the right table (in addition to the standard behavior of merge, i.e., appending rows with distinct ids and columns with distinct names).

Details:

I would like to merge + update table1 with table2 based on a shared id key column such that:

1) If table1 and table2 have columns with the same name (other than id), the value in table1 is left alone if it exists and replaced by the value in table2 if the value in table1 is NA.

2) If table2 has columns that table1 does not have (different names), they are merged (by id).

3) If table1 has an id that does not match in table2, values for different name columns from table2 are NA

4) If table2 has an id that does not match in table1, it is added as a new row and the values for the different column names from table1 are NA.

3 & 4 are as with standard merge with all=TRUE.

I'm concerned that I have overthought the problem as I cannot find a straightforward way to do this with a merge or join that doesn't involve creating ifelse checks on every column. Real data has ~1000 columns, so would be incredibly long solution to do ifelse lookups on each one.

Reproducible reduced example:

table1  <- data.table(id  =c("id1", "id2", "id3", "id4", "id5", "id6"),
                      var1=c(1,2,3,4,5, 6),
                      var2=c("a", "b", NA, "d", NA, "f"),
                      var3=c(NA, 12, 13, 14, 15, 16));

table2  <- data.table(id  =c("id1", "id2", "id3", "id4", "id5", "id8"),
                      var1=c(1,2,NA,4,5, 8),
                      var2=c(NA, "b", "c", "d", "e", "h"),
                      var4=c("foo", "bar", "oof", "rab", NA, "sna"));

desired <- data.table(id=c("id1", "id2", "id3", "id4", "id5", "id6", "id8"),
                      var1=c(1,2,3,4,5, 6, 8),
                      var2=c("a", "b", "c", "d", "e", "f", "h"),
                      var3=c(NA, 12, 13, 14, 15, 16, NA),
                      var4=c("foo", "bar", "oof", "rab", NA, NA, "sna"));

table1;
    id var1 var2 var3
1: id1    1    a   NA
2: id2    2    b   12
3: id3    3   NA   13
4: id4    4    d   14
5: id5    5    e   15
6: id6    6    f   16

table2;
    id var1 var2 var4
1: id1    1    a  foo
2: id2    2    b  bar
3: id3   NA    c  oof
4: id4    4    d  rab
5: id5    5    e   NA
6: id8    8    h  sna

desired
    id var1 var2 var3 var4
1: id1    1    a   NA  foo
2: id2    2    b   12  bar
3: id3    3    c   13  oof
4: id4    4    d   14  rab
5: id5    5    e   15   NA
6: id6    6    f   16   NA
7: id8    8    h   NA  sna

Explanation of desired output:

  1. For column var1, table1 had all the values, so it is left alone and the NA for id3 in table2 is ignored (note that this doesn't include the row merge for different ids described below).

  2. For column var2, table was missing the value indexed by id3, so it is updated from table2 (note that this doesn't include the row merge for different ids described below).

  3. For column var3, there is no matching column in table2, so it is kept as is.

  4. For column var4, there was no column var4 in table1, so it is merged from table2 by id key variable.

  5. For row with id6 in table1, there is no matching id6 in table2, so the value for column var4 that is only in table2 is NA in the desired output for row id6.

  6. For row with id8 in table2 there is no matching id8 in table1, so the value for column var3 that is only in table1 is NA in the desired output for row id8.

Surely there is a straightforward way to do this with data.table? Efficient solutions are particularly welcome given the size of the real data. The datamerge package apparently used to do exactly this, but it isn't on CRAN anymore and I can't get it to work on R3.2.3 from zip. Has another package stepped up for this task? There are many other threads that focus on solving this for a one or a couple of columns with known names, but for large number of columns, they don't seem practical.

like image 850
Mekki MacAulay Avatar asked Apr 01 '16 19:04

Mekki MacAulay


People also ask

How do I merge two datasets with common variable in R?

To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.

How do I merge two Dataframes based on a column in R?

The merge() function in base R can be used to merge input dataframes by common columns or row names. The merge() function retains all the row names of the dataframes, behaving similarly to the inner join. The dataframes are combined in order of the appearance in the input function call.

How do I merge two tables in R?

In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package.

How do I combine two Tibbles in R?

Use full_join() , left_join() , right_join() and inner_join() to merge two tables together. Specify the column(s) to match between tables using the by option. Use anti_join() to identify the rows from the first table which do not have a match in the second table.


2 Answers

Here's one way:

com.cols    = setdiff(intersect(names(table1), names(table2)), "id")
com.cols.x  = paste0(com.cols, ".x")
com.cols.y  = paste0(com.cols, ".y")

# create combined table
DT = setkey(merge(table1, table2, by="id", all=TRUE), NULL)

# edit common columns where NAs are present
for (j in seq_along(com.cols)) 
  DT[is.na(get(com.cols.x[j])), (com.cols.x[j]) := get(com.cols.y[j])]

# remove unneeded columns
DT[, (com.cols.y) := NULL]

# rename kept columns
setnames(DT, com.cols.x, com.cols)

identical(DT, desired) # TRUE

It's rather messy to create and work with all these column-names vectors.


Regarding the original question...

Here's another way (without importing new rows from table2, as in the original post):

com.cols    = setdiff(intersect(names(table1), names(table2)), "id")
i.com.cols  = paste0("i.", com.cols)
new.cols    = c(i.com.cols, setdiff(names(table2), c("id", com.cols)))

# grab columns from table2
table1[table2, (new.cols) := mget(new.cols), on="id"]

# edit common columns where NAs are present
for (j in seq_along(com.cols)) 
  table1[is.na(get(com.cols[j])), (com.cols[j]) := get(i.com.cols[j])]

# remove unneeded columns
table1[, (i.com.cols) := NULL]

This way all steps are modifications of table1 by reference.

like image 186
Frank Avatar answered Oct 16 '22 00:10

Frank


Here's another option that avoids explicitly adding the i. columns to the original table:

com.cols    = setdiff(intersect(names(table1), names(table2)), "id")
i.com.cols  = paste0("i.", com.cols)
# I'm using the same var names as Frank, but new.cols is strictly the new ones here
new.cols    = setdiff(names(table2), names(table1))

# this is easy - the previously absent cols
table1[table2, (new.cols) := mget(new.cols), on = 'id']

# now for the ones that need updating
table1[table2, on = 'id',
       (com.cols) := Map(function(col, i.col) pmin(col, i.col, na.rm = T),
                         mget(com.cols), mget(i.com.cols))]

I have no idea which option is faster - OP can check that.

like image 28
eddi Avatar answered Oct 15 '22 23:10

eddi