Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to dynamically use lapply in data.table?

I have a dataset that looks as

set.seed(18)
library(data.table)
site1 <- data.table(id = 1:10, A = c(sample(c(NA, letters[1:10]),10)), 
                    B = sample(c(NA, LETTERS[1:7]), 10, replace = T),
                    C = sample(c(NA, 1:4), 10, replace = T))

site2 <- data.table(id = c(1:4, sample(5:15, 6)), 
                    A = c(NA, NA, NA, sample(letters, 1), NA, NA, NA, sample(letters, 1), NA, NA), 
                    B = sample(LETTERS, 10), d = sample(1:5, replace = T))

and a function that looks as

col.smash <- function(a, b, linkvars){
  require(data.table)
  
  ##### CONVERT TO DATA.TABLES FOR EASIER USE, AND MERGE
  if(dim(a)[1] <= dim(b)[1]){
    c <- data.table(a); setkeyv(c, linkvars)
    d <- data.table(b); setkeyv(d, linkvars)
  } else {
    c <- data.table(b); setkeyv(c, linkvars)
    d <- data.table(a); setkeyv(d, linkvars)
  }
 
  k <- c[d]
  
  rep.list<- names(a)[names(a) %in% names(b) & !(names(a) %in% linkvars)]
  i.combo <- paste0("i.",rep.list)

  f <- k[ , (rep.list) := lapply(.SD, function(x){ifelse(is.na(x), 
                                                   get("i.", names(x)), x)}), 
          .SDcols = rep.list]
  return(f)
  }

This function's goal is to see what variables are in both site1 and site2 and if there is an "NA" in, lets say site1$A, replace it with the corresponding value in site2$A. There is a hierarchy of site1 over site2, which is why the ifelse statement only checks for one variable with "NA".

I am getting an error at the lapply function because the first ifelse result (get("i.",names(x))), after the condition isn't working properly. In doing so, I get the following error:

Error in as.environment(pos) : using 'as.environment(NULL)' is defunct

which I do not understand. Ideally, what I would get is a data.table with all values in site1 and site2 with variables A, B, C, D and not with i.A, i.B as such,

    id  A  B  C  d
 1:  1  i  E NA  4
 2:  2  g  F NA  4
 3:  3  h NA  4  1
 4:  4  x  B  4  2
 5:  5  j  G NA  NA
 6:  6  c NA  3  4
 7:  7  a  D  2  NA
 8:  8  b NA  2  NA
 9:  9  d  G  1  4
10: 10  f NA  1  NA
11: 12 NA  V NA  2
12: 13  n  J NA  1
13: 14 NA  T NA  1
14: 15 NA  X NA  1

So I think I really have two problems. The first is the error, and the second is that I am not getting all rows in k in my function. They don't seem to be related.

Any help is appreciated.

Also, brownie points for whoever can figure out the incredible col.smash reference.

like image 207
akash87 Avatar asked Feb 17 '26 01:02

akash87


1 Answers

This function's goal is to see what variables are in both site1 and site2 and if there is an "NA" in, lets say site1$A, replace it with the corresponding value in site2$A. There is a hierarchy of site1 over site2

The output can be had like

g <- function(d1, d2, byvars){
  D = funion(d1[, ..byvars], d2[, ..byvars])

  d2vars = setdiff(names(d2), byvars)
  D[d2, on=byvars, (d2vars) := mget(sprintf("i.%s", d2vars))]

  d1vars = setdiff(names(d1), byvars)
  D[d1, on=byvars, (d1vars) := mget(sprintf("i.%s", d1vars))]  

  setcolorder(D, c(byvars, d1vars, setdiff(d2vars, d1vars)))
  setorderv(D, byvars)[]
}

g(site1, site2, "id")

which gives

    id  A  B  C  d
 1:  1  i  E NA  4
 2:  2  g  F NA  4
 3:  3  h NA  4  1
 4:  4 NA  B  4  2
 5:  5  j  G NA NA
 6:  6  c NA  3  4
 7:  7  a  D  2 NA
 8:  8  b NA  2 NA
 9:  9  d  G  1  4
10: 10  f NA  1 NA
11: 12 NA  V NA  2
12: 13  n  J NA  1
13: 14 NA  T NA  1
14: 15 NA  X NA  1

How it works

The byvars argument allows for a vector of column names.

The fairly new .. syntax allows referencing an index of columns stored outside the data.table. I looked in the FAQ and ?data.table and could find no documentation. For now, it is the first changelog item in 1.10.2 at least.

To give the "hierarchy of site1 over site2", we add from site2 first and then site1, so it gets the last edit.

The use of funion assumes that there are no duplicates within each table. If there are, a more complicated approach to that step will be needed, probably something like

D = rbind(d1[, ..byvars], d2[,..byvars][!d1, on=byvars])
like image 167
Frank Avatar answered Feb 18 '26 15:02

Frank