Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use of lapply .SD in data.table R

Tags:

r

data.table

I am not very clear about use of .SD and by.

For instance, does the below snippet mean: 'change all the columns in DT to factor except A and B?' It also says in data.table manual: ".SD refers to the Subset of the data.table for each group (excluding the grouping columns)" - so columns A and B are excluded?

DT = DT[ ,lapply(.SD, as.factor), by=.(A,B)] 

However, I also read that by means like 'group by' in SQL when you do aggregation. For instance, if I would like to sum (like colsum in SQL) over all the columns except A and B do I still use something similar? Or in this case, does the below code mean to take the sum and group by values in columns A and B? (take sum and group by A,B as in SQL)

DT[,lapply(.SD,sum),by=.(A,B)] 

Then how do I do a simple colsum over all the columns except A and B?

like image 530
KTY Avatar asked Aug 28 '15 17:08

KTY


People also ask

What does .SD do in a data table?

SD stands for "Subset of Data. table". The dot before SD has no significance but doesn't let it clash with a user-defined column name.

What does .SD mean in R?

The standard deviation of an observation variable in R is calculated by the square root of its variance. The sd in R is a built-in function that accepts the input object and computes the standard deviation of the values provided in the object.

How do you use tables in R?

table in R Programming Language. For applying a function to each row of the given data. table, the user needs to call the apply() function which is the base function of R programming language, and pass the required parameter to this function to be applied in each row of the given data. table in R language.

What is data table in R?

data.table is an R package that provides an enhanced version of data.frame s, which are the standard data structure for storing data in base R. In the Data section above, we already created a data.table using fread() . We can also create one using the data.table() function.

How to use lapply in R?

How to use lapply in R? Using the lapply function is very straightforward, you just need to pass the list or vector and specify the function you want to apply to each of its elements. Consider, for instance, the following list with two elements named A and B.

What is the difference between SD and lapply?

.SD refers to the subset of the data.table for each group, excluding all columns used in by. .SD along with lapply can be used to apply any function to multiple columns by group in a data.table Apart from cyl, there are other categorical columns in the dataset such as vs, am, gear and carb.

How to use the lapply function on a data frame?

On the one hand, for all columns you could write: On the other hand, If you want to use the lapply function to certain columns of the data frame you could type: If needed, you can nest multiply lapply functions. Consider that you want to iterate over the columns and rows of a data frame and apply a function to each cell.

What does sd mean in a table?

It also says in data.table manual: " .SD refers to the Subset of the data.table for each group (excluding the grouping columns)" - so columns A and B are excluded? DT = DT [ ,lapply (.SD, as.factor), by=. (A,B)]


1 Answers

Just to illustrate the comments above with an example, let's take

set.seed(10238) # A and B are the "id" variables within which the #   "data" variables C and D vary meaningfully DT = data.table(   A = rep(1:3, each = 5L),    B = rep(1:5, 3L),   C = sample(15L),   D = sample(15L) ) DT #     A B  C  D #  1: 1 1 14 11 #  2: 1 2  3  8 #  3: 1 3 15  1 #  4: 1 4  1 14 #  5: 1 5  5  9 #  6: 2 1  7 13 #  7: 2 2  2 12 #  8: 2 3  8  6 #  9: 2 4  9 15 # 10: 2 5  4  3 # 11: 3 1  6  5 # 12: 3 2 12 10 # 13: 3 3 10  4 # 14: 3 4 13  7 # 15: 3 5 11  2 

Compare the following:

#Sum all columns DT[ , lapply(.SD, sum)] #     A  B   C   D # 1: 30 45 120 120  #Sum all columns EXCEPT A, grouping BY A DT[ , lapply(.SD, sum), by = A] #    A  B  C  D # 1: 1 15 38 43 # 2: 2 15 30 49 # 3: 3 15 52 28  #Sum all columns EXCEPT A DT[ , lapply(.SD, sum), .SDcols = !"A"] #     B   C   D # 1: 45 120 120  #Sum all columns EXCEPT A, grouping BY B DT[ , lapply(.SD, sum), by = B, .SDcols = !"A"] #    B  C  D # 1: 1 27 29 # 2: 2 17 30 # 3: 3 33 11 # 4: 4 23 36 # 5: 5 20 14 

A few notes:

  • You said "does the below snippet... change all the columns in DT..."

The answer is no, and this is very important for data.table. The object returned is a new data.table, and all of the columns in DT are exactly as they were before running the code.

  • You mentioned wanting to change the column types

Referring to the point above again, note that your code (DT[ , lapply(.SD, as.factor)]) returns a new data.table and does not change DT at all. One (incorrect) way to do this, which is done with data.frames in base, is to overwrite the old data.table with the new data.table you've returned, i.e., DT = DT[ , lapply(.SD, as.factor)].

This is wasteful because it involves creating copies of DT which can be an efficiency killer when DT is large. The correct data.table approach to this problem is to update the columns by reference using`:=`, e.g., DT[ , names(DT) := lapply(.SD, as.factor)], which creates no copies of your data. See data.table's reference semantics vignette for more on this.

  • You mentioned comparing efficiency of lapply(.SD, sum) to that of colSums. sum is internally optimized in data.table (you can note this is true from the output of adding the verbose = TRUE argument within []); to see this in action, let's beef up your DT a bit and run a benchmark:

Results:

library(data.table) set.seed(12039) nn = 1e7; kk = seq(100L) DT = setDT(replicate(26L, sample(kk, nn, TRUE), simplify=FALSE)) DT[ , LETTERS[1:2] := .(sample(100L, nn, TRUE), sample(100L, nn, TRUE))]  library(microbenchmark) microbenchmark(   times = 100L,   colsums = colSums(DT[ , !c("A", "B")]),   lapplys = DT[ , lapply(.SD, sum), .SDcols = !c("A", "B")] ) # Unit: milliseconds #     expr       min        lq      mean    median        uq       max neval #  colsums 1624.2622 2020.9064 2028.9546 2034.3191 2049.9902 2140.8962   100 #  lapplys  246.5824  250.3753  252.9603  252.1586  254.8297  266.1771   100 
like image 183
MichaelChirico Avatar answered Sep 25 '22 04:09

MichaelChirico