There are many posts which discuss applying a function over many columns when using data.table. However I need to calculate a function which depends on many columns. As an example:
# Create a data table with 26 columns. Variable names are var1, ..., var 26
data.mat = matrix(sample(letters, 26*26, replace=TRUE),ncol=26)
colnames(data.mat) = paste("var",1:26,sep="")
data.dt <- data.table(data.mat)
Now, say I would like to count the number of 'a's in columns 5,6,7 and 8. I cannot see how to do this with SDcols and end up doing:
data.dt[,numberOfAs := (var5=='a')+(var6=='a')+(var7=='a')+(var7=='a')]
Which is very tedious. Is there a more sensible way to do this?
Thanks
I really suggest going through the vignettes linked here. Section 2e from the Introduction to data.table vignette explains .SD
and .SDcols
.
.SD
is just a data.table containing the data for current group. And .SDcols
tells the columns .SD
should have. A useful way is to use print
to see the content.
# .SD contains cols 5:8
data.dt[, print(.SD), .SDcols=5:8]
Since there is no by
here, .SD
contains all the rows of data.dt
, corresponding to the columns specified in .SDcols
.
Once you understand this, the task reduces to your knowledge of base R really. You can accomplish this in more than one way.
data.dt[, numberOfAs := rowSums(.SD == "a"), .SDcols=5:8]
We return a logical matrix by comparing all the columns in .SD
to "a". And then use rowSums
to sum them up.
Another way using Reduce
:
data.dt[, numberOfAs := Reduce(`+`, lapply(.SD, function(x) x == "a")), .SDcols=5:8]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With