Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Allow .SDcols to vary with grouping variable in data.table

Tags:

r

data.table

Is it allowable to have .SDcols vary with the by grouping variable? I have the following situation, where I would like to change .SDcols to different columns for each year. The values for the .SDcols are in one data.table, while I am trying to apply a function to the .SD in another table using these values.

Quite likely I am missing the obvious approach and doing this wrong, but this is what I was attempting,

## Contains the .SDcols applicable to each year
dat1 <- data.table(
  year = 1:4,
  vals = lapply(1:4, function(i) letters[1:i])
)

## Make the sample data (with NAs)
set.seed(1775)
dat2 <- data.table( year = sample(1:4, 10, TRUE) )
dat2[, letters[1:4] := replicate(4, sample(c(NA, 1:5), 10, TRUE), simplify=FALSE)]

## Goal: Sum up the columns in the corresponding .SDcols for each year
## Attempt, doesn't work -- I think b/c .SDcols must be fixed?
dat2[, SUM := rowSums(.SD, na.rm=TRUE), by=year, 
  .SDcols=unlist(dat1[year == .BY[[1]], vals])]

## Desired result, by simply iterating through each possible year
for (i in 1:4) {
  dat2[year==i, SUM := rowSums(.SD, na.rm=TRUE), 
    .SDcols=unlist(dat1[year == i, vals])]
}

dat2[]
#     year  a  b c  d SUM
#  1:    1  3  1 5  1   3
#  2:    2  1  3 3  1   4
#  3:    1  5  4 3 NA   5
#  4:    4  1 NA 4  5  10
#  5:    2  2  2 2 NA   4
#  6:    2 NA  3 3 NA   3
#  7:    4  2  3 2 NA   7
#  8:    1  2 NA 5  4   2
#  9:    2  3  3 5  1   6
# 10:    3 NA  4 2 NA   6
like image 643
Rorschach Avatar asked Feb 19 '16 08:02

Rorschach


1 Answers

It seems to me that you are just looking for a simple join while updating the values (by reference) by each value in dat1 (by = .EACHI). Either way, rowSums is the bottle neck in both this solution and your attempt (because of the matrix conversion). If I were you, I would convert all the NAs to zeroes and run Reduce(`+`,...) instead (not sure though if you want to change the values in your original data)

dat2[dat1, 
      SUM := rowSums(.SD[, unlist(i.vals), with = FALSE], na.rm = TRUE), 
      on = "year", 
     by = .EACHI]
dat2
#     year  a  b c  d SUM
#  1:    1  3  1 5  1   3
#  2:    2  1  3 3  1   4
#  3:    1  5  4 3 NA   5
#  4:    4  1 NA 4  5  10
#  5:    2  2  2 2 NA   4
#  6:    2 NA  3 3 NA   3
#  7:    4  2  3 2 NA   7
#  8:    1  2 NA 5  4   2
#  9:    2  3  3 5  1   6
# 10:    3 NA  4 2 NA   6

Though if I were you, as mentioned, I would convert the NAs to zeroes and use Reduce instead

for(j in 2:ncol(dat2)) set(dat2, i = which(is.na(dat2[[j]])), j = j, value = 0L)
dat2[dat1,
       SUM := Reduce(`+`, .SD[, unlist(i.vals), with = FALSE]), 
       on = "year", 
    by = .EACHI]
dat2
#     year a b c d SUM
#  1:    1 3 1 5 1   3
#  2:    2 1 3 3 1   4
#  3:    1 5 4 3 0   5
#  4:    4 1 0 4 5  10
#  5:    2 2 2 2 0   4
#  6:    2 0 3 3 0   3
#  7:    4 2 3 2 0   7
#  8:    1 2 0 5 4   2
#  9:    2 3 3 5 1   6
# 10:    3 0 4 2 0   6
like image 52
David Arenburg Avatar answered Sep 25 '22 13:09

David Arenburg