Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using mean with .SD and .SDcols in data.table

Tags:

r

data.table

mean

I am writing a very simple function to summarize columns of data.tables. I am passing one column at a time to the function, and then doing some diagnostics to figure out the options for summarization, and then doing the summarization. I am doing this in data.table to allow for some very large datasets.

So, I am using .SDcols to pass in the column to summarize, and using functions on .SD in the j part of a data.table expression. Since I am passing in one column at a time, I am not using lapply. And what I am finding is that some functions work and others do not. Below is a test dataset I am working with and the results I see:

dt <- data.table(
  a=1:10, 
  b=as.factor(letters[1:10]), 
  c=c(TRUE, FALSE), 
  d=runif(10, 0.5, 100), 
  e=c(0,1), 
  f=as.integer(c(0,1)), 
  g=as.numeric(1:10), 
  h=c("cat1", "cat2", "cat3", "cat4", "cat5"))

mean(dt$a)
[1] 5.5

dt[, mean(.SD), .SDcols = "a"]

[1] NA
Warning message:
In mean.default(.SD) : argument is not numeric or logical: returning NA

dt[, sum(.SD), .SDcols = "a"]
[1] 55

dt[, max(.SD), .SDcols = "a"]
[1] 10

dt[, colMeans(.SD), .SDcols = "a"]
  a 
5.5 

dt[, lapply(.SD, mean), .SDcols = "a"]
     a
1: 5.5

Interestingly, weighted.mean gives the wrong answer (55, the sum) when I use weighted.mean(.SD) in j. But when I use lapply(.SD, weighted.mean) in j, it gives the right answer (5.5, the mean).

I tried turning off data.table optimizations to see if it was the internal data.table mean function, but that didn't change things.

Maybe this is just a problem with using mean() on a list (which seems to be what .SD returns)? I guess there is never a reason to NOT use the lapply paradigm with .SD? It seems that only the lapply option returns a data.table. The others seem to return vectors, except for colMeans which is returning something else (list?).

My main question is why mean(.SD) does not work. And the corollary is whether .SD can be used in the absence of one of the apply functions.

Thanks.

like image 349
Mark Danese Avatar asked Apr 10 '15 18:04

Mark Danese


1 Answers

I think the appropriate way of approaching what you want is to just use the standard syntax:

dt[ , lapply(.SD, mean), .SDcols = "a"]

Alternatively, you can pass a variable by name as follows:

col_to_pass = "a"
dt[ , mean(get(col_to_pass)) ]

Eventually, you can generalized this approach to multiple columns as follows:

col_to_pass = c("a", "d")
dt[ , lapply( mget(col_to_pass), mean) ]
like image 171
Francesco Grossetti Avatar answered Nov 15 '22 23:11

Francesco Grossetti