R: Efficient Way to Merge+Update Table With Second Table Where Values from Same Column Names Fill NAs

Tags:

Summary: I would like to merge two tables by shared id key as all=true (full outer join) where instead of columns with the same names being set as var1.x var2.y, etc., they are merged as a single column where missing (NA) values in the left table are filled in by values from the right table (in addition to the standard behavior of merge, i.e., appending rows with distinct ids and columns with distinct names).

Details:

I would like to merge + update table1 with table2 based on a shared id key column such that:

1) If table1 and table2 have columns with the same name (other than id), the value in table1 is left alone if it exists and replaced by the value in table2 if the value in table1 is NA.

2) If table2 has columns that table1 does not have (different names), they are merged (by id).

3) If table1 has an id that does not match in table2, values for different name columns from table2 are NA

4) If table2 has an id that does not match in table1, it is added as a new row and the values for the different column names from table1 are NA.

3 & 4 are as with standard merge with all=TRUE.

I'm concerned that I have overthought the problem as I cannot find a straightforward way to do this with a merge or join that doesn't involve creating ifelse checks on every column. Real data has ~1000 columns, so would be incredibly long solution to do ifelse lookups on each one.

Reproducible reduced example:

table1  <- data.table(id  =c("id1", "id2", "id3", "id4", "id5", "id6"),
                      var1=c(1,2,3,4,5, 6),
                      var2=c("a", "b", NA, "d", NA, "f"),
                      var3=c(NA, 12, 13, 14, 15, 16));

table2  <- data.table(id  =c("id1", "id2", "id3", "id4", "id5", "id8"),
                      var1=c(1,2,NA,4,5, 8),
                      var2=c(NA, "b", "c", "d", "e", "h"),
                      var4=c("foo", "bar", "oof", "rab", NA, "sna"));

desired <- data.table(id=c("id1", "id2", "id3", "id4", "id5", "id6", "id8"),
                      var1=c(1,2,3,4,5, 6, 8),
                      var2=c("a", "b", "c", "d", "e", "f", "h"),
                      var3=c(NA, 12, 13, 14, 15, 16, NA),
                      var4=c("foo", "bar", "oof", "rab", NA, NA, "sna"));

table1;
    id var1 var2 var3
1: id1    1    a   NA
2: id2    2    b   12
3: id3    3   NA   13
4: id4    4    d   14
5: id5    5    e   15
6: id6    6    f   16

table2;
    id var1 var2 var4
1: id1    1    a  foo
2: id2    2    b  bar
3: id3   NA    c  oof
4: id4    4    d  rab
5: id5    5    e   NA
6: id8    8    h  sna

desired
    id var1 var2 var3 var4
1: id1    1    a   NA  foo
2: id2    2    b   12  bar
3: id3    3    c   13  oof
4: id4    4    d   14  rab
5: id5    5    e   15   NA
6: id6    6    f   16   NA
7: id8    8    h   NA  sna

Explanation of desired output:

For column var1, table1 had all the values, so it is left alone and the NA for id3 in table2 is ignored (note that this doesn't include the row merge for different ids described below).
For column var2, table was missing the value indexed by id3, so it is updated from table2 (note that this doesn't include the row merge for different ids described below).
For column var3, there is no matching column in table2, so it is kept as is.
For column var4, there was no column var4 in table1, so it is merged from table2 by id key variable.
For row with id6 in table1, there is no matching id6 in table2, so the value for column var4 that is only in table2 is NA in the desired output for row id6.
For row with id8 in table2 there is no matching id8 in table1, so the value for column var3 that is only in table1 is NA in the desired output for row id8.

Surely there is a straightforward way to do this with data.table? Efficient solutions are particularly welcome given the size of the real data. The datamerge package apparently used to do exactly this, but it isn't on CRAN anymore and I can't get it to work on R3.2.3 from zip. Has another package stepped up for this task? There are many other threads that focus on solving this for a one or a couple of columns with known names, but for large number of columns, they don't seem practical.

850

asked Apr 01 '16 19:04

Mekki MacAulay

2 Answers

Here's one way:

com.cols    = setdiff(intersect(names(table1), names(table2)), "id")
com.cols.x  = paste0(com.cols, ".x")
com.cols.y  = paste0(com.cols, ".y")

# create combined table
DT = setkey(merge(table1, table2, by="id", all=TRUE), NULL)

# edit common columns where NAs are present
for (j in seq_along(com.cols)) 
  DT[is.na(get(com.cols.x[j])), (com.cols.x[j]) := get(com.cols.y[j])]

# remove unneeded columns
DT[, (com.cols.y) := NULL]

# rename kept columns
setnames(DT, com.cols.x, com.cols)

identical(DT, desired) # TRUE

It's rather messy to create and work with all these column-names vectors.

Regarding the original question...

Here's another way (without importing new rows from table2, as in the original post):

com.cols    = setdiff(intersect(names(table1), names(table2)), "id")
i.com.cols  = paste0("i.", com.cols)
new.cols    = c(i.com.cols, setdiff(names(table2), c("id", com.cols)))

# grab columns from table2
table1[table2, (new.cols) := mget(new.cols), on="id"]

# edit common columns where NAs are present
for (j in seq_along(com.cols)) 
  table1[is.na(get(com.cols[j])), (com.cols[j]) := get(i.com.cols[j])]

# remove unneeded columns
table1[, (i.com.cols) := NULL]

This way all steps are modifications of table1 by reference.

186

answered Oct 16 '22 00:10

Frank

Here's another option that avoids explicitly adding the i. columns to the original table:

com.cols    = setdiff(intersect(names(table1), names(table2)), "id")
i.com.cols  = paste0("i.", com.cols)
# I'm using the same var names as Frank, but new.cols is strictly the new ones here
new.cols    = setdiff(names(table2), names(table1))

# this is easy - the previously absent cols
table1[table2, (new.cols) := mget(new.cols), on = 'id']

# now for the ones that need updating
table1[table2, on = 'id',
       (com.cols) := Map(function(col, i.col) pmin(col, i.col, na.rm = T),
                         mget(com.cols), mget(i.com.cols))]

I have no idea which option is faster - OP can check that.

answered Oct 15 '22 23:10

eddi

Related questions
                            
                                Ungroup after grouping by just one variable in dplyr
                            
                                Create a list of all values of a variable grouped by another variable in R
                            
                                How to change caption label names in a single document with Bookdown?
                            
                                Merge separate divergent size and fill (or color) legends in ggplot showing absolute magnitude with the size scale
                            
                                Generating a vector of the number of items in each list item
                            
                                How do I plot only the time portion of a timestamp including a date?
                            
                                Ordering stacks by size in a ggplot2 stacked bar graph
                            
                                formatter argument in scale_continuous throwing errors in R 2.15
                            
                                Combining S4 and S3 methods in a single function
                            
                                How to resize a NumericVector?
                            
                                How to use 'facet' to create multiple density plot in GGPLOT
                            
                                Find common substrings between two character variables
                            
                                How do I test if three variables are equal [R]
                            
                                geom_tile and facet_grid/facet_wrap for same height of tiles
                            
                                shiny: passing reactiveValues to conditionalPanel
                            
                                "update by reference" vs shallow copy
                            
                                Shiny Responds to Enter
                            
                                conditionally output different colored text in Shiny
                            
                                Centre a plot to the middle of a page using Knitr
                            
                                Convert xml_nodeset to data.frame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R: Efficient Way to Merge+Update Table With Second Table Where Values from Same Column Names Fill NAs

Tags:

merge

join

r

data.table

Mekki MacAulay

People also ask

2 Answers

Frank

eddi

Recent Activity

Donate For Us