Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Row operations in data.table

Tags:

r

data.table

mean

Im trying to perform a simple sum and mean by rows using data.table, but I am getting unexpected results. I followed the help in section 2 of the FAQ manual for data.table. I found a way that works, but I am not sure why this method in section 2 of the FAQ is not. This method gives me the incorrect result (i.e., it is giving me the value of the first column):

dt[, genesum:=lapply(.SD,sum), by=gene]
head(dt)

      gene      TCGA_04_1348      TCGA_04_1362   genesum  
  1:    A1BG          0.94565          0.70585  0.94565   
  2: A1BG-AS          0.97610          1.15850  0.97610   
  3:    A1CF          0.00000          0.02105  0.00000   
  4:   A2BP1          0.00300          0.04150  0.00300   
  5:   A2LD1          4.57975          5.02820  4.57975  
  6:     A2M         60.37320         36.09715 60.37320 

and this is giving me the desired result

dt[, genesum:=apply(dt[,-1, with=FALSE],1, sum)]
head(dt)

       gene     TCGA_04_1348       TCGA_04_1362 genesum
  1:    A1BG          0.94565          0.70585  1.65150
  2: A1BG-AS          0.97610          1.15850  2.13460
  3:    A1CF          0.00000          0.02105  0.02105
  4:   A2BP1          0.00300          0.04150  0.04450
  5:   A2LD1          4.57975          5.02820  9.60795
  6:     A2M         60.37320         36.09715 96.47035

I have many more columns and rows, this is just a subset. Does this have anything to do with the way I set the key?

tables()
 NAME        NROW    MB COLS                                               KEY                                             
 [1,] dt     20,785  2  gene,TCGA_04_1348_01A,TCGA_04_1362_01A,genesum    gene
like image 398
sahir Avatar asked Oct 01 '22 15:10

sahir


1 Answers

A few things:

  1. dt[, genesum:=lapply(.SD,sum), by=gene] and dt[, genesum:=apply(dt[ ,-1],1, sum)] are quite different.

    • dt[, genesum:=lapply(.SD,sum), by=gene] loops over the columns of the .SD data.table and sums them

    • dt[, genesum:=apply(dt[, -1], 1, sum)] is looping over the rows (ie. apply(x, 1, function) applies function to every row in x

  2. I think you can get what you want by calling rowSums, like so:

    dt[, genesum := rowSums(dt[, -1])]
    

Is that what you're after?

like image 117
Steve Lianoglou Avatar answered Oct 05 '22 10:10

Steve Lianoglou