Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using data.table to calculate a function which depends on many columns

Tags:

r

data.table

There are many posts which discuss applying a function over many columns when using data.table. However I need to calculate a function which depends on many columns. As an example:

# Create a data table with 26 columns.  Variable names are var1, ..., var 26
data.mat = matrix(sample(letters, 26*26, replace=TRUE),ncol=26)
colnames(data.mat) = paste("var",1:26,sep="")
data.dt <- data.table(data.mat)

Now, say I would like to count the number of 'a's in columns 5,6,7 and 8. I cannot see how to do this with SDcols and end up doing:

data.dt[,numberOfAs := (var5=='a')+(var6=='a')+(var7=='a')+(var7=='a')]

Which is very tedious. Is there a more sensible way to do this?

Thanks

like image 947
Alan Chalk Avatar asked Jan 09 '23 00:01

Alan Chalk


1 Answers

I really suggest going through the vignettes linked here. Section 2e from the Introduction to data.table vignette explains .SD and .SDcols.

.SD is just a data.table containing the data for current group. And .SDcols tells the columns .SD should have. A useful way is to use print to see the content.

# .SD contains cols 5:8
data.dt[, print(.SD), .SDcols=5:8]

Since there is no by here, .SD contains all the rows of data.dt, corresponding to the columns specified in .SDcols.


Once you understand this, the task reduces to your knowledge of base R really. You can accomplish this in more than one way.

data.dt[, numberOfAs := rowSums(.SD == "a"), .SDcols=5:8]

We return a logical matrix by comparing all the columns in .SD to "a". And then use rowSums to sum them up.

Another way using Reduce:

data.dt[, numberOfAs := Reduce(`+`, lapply(.SD, function(x) x == "a")), .SDcols=5:8]
like image 149
Arun Avatar answered Jan 11 '23 17:01

Arun