Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table computing several column at once

Tags:

r

data.table

Thank you in advance for reading this. I have a function which was working just fine on data.table 1.9.3. But today I updated my data.table package and my function does not work.

Here is my function and working example on data.table 1.9.3:

trait.by <- function(data,traits="",cross.by){
  traits = intersect(traits,names(data))
  if(length(traits)<1){  
    #if there is no intersect between names and traits
    return(      data[,       list(N. = .N),    by=cross.by])
  }else{
    return(data[,c(   N. = .N,
                    MEAN = lapply(.SD,function(x){return(round(mean(x,na.rm=T),digits=1))}) , 
                    SD   = lapply(.SD,function(x){return(round(sd  (x,na.rm=T),digits=2))}) ,
                    'NA' = lapply(.SD,function(x){return(sum  (is.na(x)))})),
                 by=cross.by, .SDcols = traits])
  }
}

> trait.by(data.table(iris),traits = c("Sepal.Length",    "Sepal.Width"),cross.by="Species")
#      Species N. MEAN.Sepal.Length MEAN.Sepal.Width SD.Sepal.Length
#1:     setosa 50               5.0              3.4            0.35
#2: versicolor 50               5.9              2.8            0.52
#3:  virginica 50               6.6              3.0            0.64
#   SD.Sepal.Width NA.Sepal.Length NA.Sepal.Width
#1:           0.38               0              0
#2:           0.31               0              0
#3:           0.32               0              0

The point is MEAN.(traits), SD.(traits) and NA.(traits) are computed for all columns that I give in traits variable.


When I run this with data.table 1.9.4 I receive the following error:

> trait.by(data.table(iris),traits = c("Sepal.Length",    "Sepal.Width"),cross.by="Species")
#Error in assign("..FUN", eval(fun, SDenv, SDenv), SDenv) : 
#  cannot change value of locked binding for '..FUN'

Any idea how I should fix this?!

like image 486
Mahdi Jadaliha Avatar asked Dec 15 '14 23:12

Mahdi Jadaliha


People also ask

How do I add multiple columns in R?

More specifically, you will learn 1) to add a column using base R (i.e., by using the $-operator and brackets, 2) add a column using the add_column() function (i.e., from tibble), 3) add multiple columns, and 4) to add columns from one dataframe to another.

Is data table DT == true?

data. table(DT) is TRUE. To better description, I put parts of my original code here. So you may understand where goes wrong.

How do I add a column to a data table?

You create DataColumn objects within a table by using the DataColumn constructor, or by calling the Add method of the Columns property of the table, which is a DataColumnCollection. The Add method accepts optional ColumnName, DataType, and Expression arguments and creates a new DataColumn as a member of the collection.


2 Answers

Update: This has been fixed now in 1.9.5 in commit 1680. From NEWS:

  1. Fixed a bug in the internal optimisation of j-expression with more than one lapply(.SD, function(..) ..) as illustrated here on SO. Closes #985. Thanks to @jadaliha for the report and to @BrodieG for the debugging on SO.

Now this works as expected:

data[,
  c(
    MEAN = lapply(.SD,function(x){return(round(mean(x,na.rm=T),digits=1))}),
    SD = lapply(.SD,function(x){return(round(sd  (x,na.rm=T),digits=2))})
  ), by=cross.by, .SDcols = traits]    

This looks like a bug that manifests as a result of multiple uses of lapply(.SD, FUN) in one data.table call in combination with c(. You can work around it by replacing c( with .(.

traits <- c("Sepal.Length",    "Sepal.Width")
cross.by <- "Species"
data <- data.table(iris)

data[,
  c(
    MEAN = lapply(.SD,function(x){return(round(mean(x,na.rm=T),digits=1))})
  ),
  by=cross.by, .SDcols = traits
]

Works.

data[,
  c(
    SD = lapply(.SD,function(x){return(round(sd  (x,na.rm=T),digits=2))})
  ),
  by=cross.by, .SDcols = traits
]

Works.

data[,
  c(
    MEAN = lapply(.SD,function(x){return(round(mean(x,na.rm=T),digits=1))}),
    SD = lapply(.SD,function(x){return(round(sd  (x,na.rm=T),digits=2))})
  ),
  by=cross.by, .SDcols = traits
]    

Doesn't work

data[,
  .(
    MEAN = lapply(.SD,function(x){return(round(mean(x,na.rm=T),digits=1))}),
    SD = lapply(.SD,function(x){return(round(sd  (x,na.rm=T),digits=2))})
  ),
  by=cross.by, .SDcols = traits
]

Works.

like image 86
BrodieG Avatar answered Nov 07 '22 00:11

BrodieG


Like this ? The output format changed slightly. But the result is all there.

trait.by <- function(data,traits="",cross.by){
  traits = intersect(traits,names(data))
  if(length(traits)<1){  
    #if there is no intersect between names and traits
    return(data[, list(N. = .N), by=cross.by])
  }else{
    # ** Changes: use list instead of c and don't think we need return here.
    # and add new col_Nam with refernce to comments below
    return(data[, list(N. = .N,
                       MEAN = lapply(.SD,function(x){round(mean(x,na.rm=T),digits=1)}) , 
                       SD   = lapply(.SD,function(x){round(sd  (x,na.rm=T),digits=2)}) ,
                       'NA' = lapply(.SD,function(x){sum  (is.na(x))}),
                       col_Nam = names(.SD)),
                by=cross.by, .SDcols = traits])
  }
}
trait.by(data.table(iris),traits = c("Sepal.Length", "Sepal.Width"),cross.by="Species")

# result
      Species N. MEAN   SD NA      col_Nam
1:     setosa 50    5 0.35  0 Sepal.Length
2:     setosa 50  3.4 0.38  0  Sepal.Width
3: versicolor 50  5.9 0.52  0 Sepal.Length
4: versicolor 50  2.8 0.31  0  Sepal.Width
5:  virginica 50  6.6 0.64  0 Sepal.Length
6:  virginica 50    3 0.32  0  Sepal.Width
like image 37
KFB Avatar answered Nov 07 '22 00:11

KFB