cbind specific columns from multiple data.tables efficiently

Question

I have a list of data.tables that I need to cbind, however, I only need the last X columns.

My data is structured as follows:

DT.1 <- data.table(x=c(1,1), y = c("a","a"), v1 = c(1,2), v2 = c(3,4))
DT.2 <- data.table(x=c(1,1), y = c("a","a"), v3 = c(5,6))
DT.3 <- data.table(x=c(1,1), y = c("a","a"), v4 = c(7,8), v5 = c(9,10), v6 = c(11,12))

DT.list <- list(DT.1, DT.2, DT.3)
>DT.list
    [[1]]
   x y v1 v2
1: 1 a  1  3
2: 1 a  2  4

[[2]]
   x y v3
1: 1 a  5
2: 1 a  6

[[3]]
   x y v4 v5 v6
1: 1 a  7  9 11
2: 1 a  8 10 12

Columns x and y are the same for each of the data.tables but the amount of columns differs. The output should not include duplicate x, and y columns. It should look as follows:

   x y v1 v2 v3 v4 v5 v6
1: 1 a  1  3  5  7  9 11
2: 1 a  2  4  6  8 10 12

I want to avoid using a loop. I am able to bind the data.tables using do.call("cbind", DT.list) and then remove the duplicates manually, but is there a way where the duplicates aren't created in the first place? Also, efficiency is important since the lists can be long with large data.tables.

thanks

Frank · Accepted Answer

Here's another way:

Reduce(
  function(x,y){
    newcols = setdiff(names(y),names(x))
    x[,(newcols)] <- y[, ..newcols]
    x
  }, 
  DT.list,
  init = copy(DT.list[[1]][,c("x","y")])
)
#    x y v1 v2 v3 v4 v5 v6
# 1: 1 a  1  3  5  7  9 11
# 2: 1 a  2  4  6  8 10 12

This avoids modifying the list (as @bgoldst's <- NULL assignment does) or making copies of every element of the list (as, I think, the lapply approach does). I would probably do the <- NULL thing in most practical applications, though.

bgoldst · Answer

Here's how it could be done in one shot, using lapply() to remove columns x and y from second-and-subsequent data.tables before calling cbind():

do.call(cbind,c(DT.list[1],lapply(DT.list[2:length(DT.list)],`[`,j=-c(1,2))));
##    x y v1 v2 v3 v4 v5 v6
## 1: 1 a  1  3  5  7  9 11
## 2: 1 a  2  4  6  8 10 12

Another approach is to remove columns x and y from second-and-subsequent data.tables before doing a straight cbind(). I think there's nothing wrong with using a for loop for this:

for (i in seq_along(DT.list)[-1]) DT.list[[i]][,c('x','y')] <- NULL;
DT.list;
## [[1]]
##    x y v1 v2
## 1: 1 a  1  3
## 2: 1 a  2  4
##
## [[2]]
##    v3
## 1:  5
## 2:  6
##
## [[3]]
##    v4 v5 v6
## 1:  7  9 11
## 2:  8 10 12
##
do.call(cbind,DT.list);
##    x y v1 v2 v3 v4 v5 v6
## 1: 1 a  1  3  5  7  9 11
## 2: 1 a  2  4  6  8 10 12

cbind specific columns from multiple data.tables efficiently

Tags:

r

data.table

cbind

greyBag

2 Answers

Frank

bgoldst

Recent Activity

Donate For Us

cbind specific columns from multiple data.tables efficiently

Tags:

r

data.table

cbind

greyBag

2 Answers

Frank

bgoldst

Related questions

Recent Activity

Donate For Us