I am not very clear about use of <code>.SD</code> and <code>by</code>. For instance, does the below snippet mean: 'change all the columns in <code>DT</code> to factor except <code>A</code> and <code>B</code>?' It also says in <code>data.table</code> manual: "<code>.SD</code> refers to the Subset of the <code>data.table</code> for each group (excluding the grouping columns)" - so columns <code>A</code> and <code>B</code> are excluded? <pre class="prettyprint"><code>DT = DT[ ,lapply(.SD, as.factor), by=.(A,B)] </code></pre> However, I also read that <code>by</code> means like 'group by' in SQL when you do aggregation. For instance, if I would like to sum (like <code>colsum</code> in SQL) over all the columns except <code>A</code> and <code>B</code> do I still use something similar? Or in this case, does the below code mean to take the sum and group by values in columns <code>A</code> and <code>B</code>? (take sum and group by <code>A,B</code> as in SQL) <pre class="prettyprint"><code>DT[,lapply(.SD,sum),by=.(A,B)] </code></pre> Then how do I do a simple <code>colsum</code> over all the columns except <code>A</code> and <code>B</code>?

Just to illustrate the comments above with an example, let's take <pre class="prettyprint"><code>set.seed(10238) # A and B are the "id" variables within which the # "data" variables C and D vary meaningfully DT = data.table( A = rep(1:3, each = 5L), B = rep(1:5, 3L), C = sample(15L), D = sample(15L) ) DT # A B C D # 1: 1 1 14 11 # 2: 1 2 3 8 # 3: 1 3 15 1 # 4: 1 4 1 14 # 5: 1 5 5 9 # 6: 2 1 7 13 # 7: 2 2 2 12 # 8: 2 3 8 6 # 9: 2 4 9 15 # 10: 2 5 4 3 # 11: 3 1 6 5 # 12: 3 2 12 10 # 13: 3 3 10 4 # 14: 3 4 13 7 # 15: 3 5 11 2 </code></pre> Compare the following: <pre class="prettyprint"><code>#Sum all columns DT[ , lapply(.SD, sum)] # A B C D # 1: 30 45 120 120 #Sum all columns EXCEPT A, grouping BY A DT[ , lapply(.SD, sum), by = A] # A B C D # 1: 1 15 38 43 # 2: 2 15 30 49 # 3: 3 15 52 28 #Sum all columns EXCEPT A DT[ , lapply(.SD, sum), .SDcols = !"A"] # B C D # 1: 45 120 120 #Sum all columns EXCEPT A, grouping BY B DT[ , lapply(.SD, sum), by = B, .SDcols = !"A"] # B C D # 1: 1 27 29 # 2: 2 17 30 # 3: 3 33 11 # 4: 4 23 36 # 5: 5 20 14 </code></pre> A few notes: <ul> <li>You said "does the below snippet... change all the columns in <code>DT</code>..."</li> </ul> The answer is no, and this is very important for <code>data.table</code>. The object returned is a new <code>data.table</code>, and all of the columns in <code>DT</code> are exactly as they were before running the code. <ul> <li>You mentioned wanting to change the column types</li> </ul> Referring to the point above again, note that your code (<code>DT[ , lapply(.SD, as.factor)]</code>) returns a new <code>data.table</code> and does not change <code>DT</code> at all. One (incorrect) way to do this, which is done with <code>data.frame</code>s in <code>base</code>, is to overwrite the old <code>data.table</code> with the new <code>data.table</code> you've returned, i.e., <code>DT = DT[ , lapply(.SD, as.factor)]</code>. This is wasteful because it involves creating copies of <code>DT</code> which can be an efficiency killer when <code>DT</code> is large. The correct <code>data.table</code> approach to this problem is to update the columns by reference using<code>`:=`</code>, e.g., <code>DT[ , names(DT) := lapply(.SD, as.factor)]</code>, which creates no copies of your data. See <code>data.table</code>'s reference semantics vignette for more on this. <ul> <li>You mentioned comparing efficiency of <code>lapply(.SD, sum)</code> to that of <code>colSums</code>. <code>sum</code> is internally optimized in <code>data.table</code> (you can note this is true from the output of adding the <code>verbose = TRUE</code> argument within <code>[]</code>); to see this in action, let's beef up your <code>DT</code> a bit and run a benchmark:</li> </ul> Results: <pre class="prettyprint"><code>library(data.table) set.seed(12039) nn = 1e7; kk = seq(100L) DT = setDT(replicate(26L, sample(kk, nn, TRUE), simplify=FALSE)) DT[ , LETTERS[1:2] := .(sample(100L, nn, TRUE), sample(100L, nn, TRUE))] library(microbenchmark) microbenchmark( times = 100L, colsums = colSums(DT[ , !c("A", "B")]), lapplys = DT[ , lapply(.SD, sum), .SDcols = !c("A", "B")] ) # Unit: milliseconds # expr min lq mean median uq max neval # colsums 1624.2622 2020.9064 2028.9546 2034.3191 2049.9902 2140.8962 100 # lapplys 246.5824 250.3753 252.9603 252.1586 254.8297 266.1771 100 </code></pre>

Use of lapply .SD in data.table R

Tags:

r

data.table

I am not very clear about use of .SD and by.

For instance, does the below snippet mean: 'change all the columns in DT to factor except A and B?' It also says in data.table manual: ".SD refers to the Subset of the data.table for each group (excluding the grouping columns)" - so columns A and B are excluded?

DT = DT[ ,lapply(.SD, as.factor), by=.(A,B)]

However, I also read that by means like 'group by' in SQL when you do aggregation. For instance, if I would like to sum (like colsum in SQL) over all the columns except A and B do I still use something similar? Or in this case, does the below code mean to take the sum and group by values in columns A and B? (take sum and group by A,B as in SQL)

DT[,lapply(.SD,sum),by=.(A,B)]

Then how do I do a simple colsum over all the columns except A and B?

530

asked Aug 28 '15 17:08

KTY

1 Answers

Just to illustrate the comments above with an example, let's take

set.seed(10238) # A and B are the "id" variables within which the #   "data" variables C and D vary meaningfully DT = data.table(   A = rep(1:3, each = 5L),    B = rep(1:5, 3L),   C = sample(15L),   D = sample(15L) ) DT #     A B  C  D #  1: 1 1 14 11 #  2: 1 2  3  8 #  3: 1 3 15  1 #  4: 1 4  1 14 #  5: 1 5  5  9 #  6: 2 1  7 13 #  7: 2 2  2 12 #  8: 2 3  8  6 #  9: 2 4  9 15 # 10: 2 5  4  3 # 11: 3 1  6  5 # 12: 3 2 12 10 # 13: 3 3 10  4 # 14: 3 4 13  7 # 15: 3 5 11  2

Compare the following:

#Sum all columns DT[ , lapply(.SD, sum)] #     A  B   C   D # 1: 30 45 120 120  #Sum all columns EXCEPT A, grouping BY A DT[ , lapply(.SD, sum), by = A] #    A  B  C  D # 1: 1 15 38 43 # 2: 2 15 30 49 # 3: 3 15 52 28  #Sum all columns EXCEPT A DT[ , lapply(.SD, sum), .SDcols = !"A"] #     B   C   D # 1: 45 120 120  #Sum all columns EXCEPT A, grouping BY B DT[ , lapply(.SD, sum), by = B, .SDcols = !"A"] #    B  C  D # 1: 1 27 29 # 2: 2 17 30 # 3: 3 33 11 # 4: 4 23 36 # 5: 5 20 14

A few notes:

You said "does the below snippet... change all the columns in DT..."

The answer is no, and this is very important for data.table. The object returned is a new data.table, and all of the columns in DT are exactly as they were before running the code.

You mentioned wanting to change the column types

Referring to the point above again, note that your code (DT[ , lapply(.SD, as.factor)]) returns a new data.table and does not change DT at all. One (incorrect) way to do this, which is done with data.frames in base, is to overwrite the old data.table with the new data.table you've returned, i.e., DT = DT[ , lapply(.SD, as.factor)].

This is wasteful because it involves creating copies of DT which can be an efficiency killer when DT is large. The correct data.table approach to this problem is to update the columns by reference using`:=`, e.g., DT[ , names(DT) := lapply(.SD, as.factor)], which creates no copies of your data. See data.table's reference semantics vignette for more on this.

You mentioned comparing efficiency of lapply(.SD, sum) to that of colSums. sum is internally optimized in data.table (you can note this is true from the output of adding the verbose = TRUE argument within []); to see this in action, let's beef up your DT a bit and run a benchmark:

Results:

library(data.table) set.seed(12039) nn = 1e7; kk = seq(100L) DT = setDT(replicate(26L, sample(kk, nn, TRUE), simplify=FALSE)) DT[ , LETTERS[1:2] := .(sample(100L, nn, TRUE), sample(100L, nn, TRUE))]  library(microbenchmark) microbenchmark(   times = 100L,   colsums = colSums(DT[ , !c("A", "B")]),   lapplys = DT[ , lapply(.SD, sum), .SDcols = !c("A", "B")] ) # Unit: milliseconds #     expr       min        lq      mean    median        uq       max neval #  colsums 1624.2622 2020.9064 2028.9546 2034.3191 2049.9902 2140.8962   100 #  lapplys  246.5824  250.3753  252.9603  252.1586  254.8297  266.1771   100

183

answered Sep 25 '22 04:09

MichaelChirico

Related questions
                            
                                caught segfault error in R
                            
                                What is the practical use of the identity function in R?
                            
                                Is there a way to use two '...' statements in a function in R?
                            
                                Aesthetics must either be length one, or the same length as the dataProblems
                            
                                How can I match fuzzy match strings from two datasets?
                            
                                Renaming Objects in RStudio context sensitive within entire Project
                            
                                R Markdown Bullet List with Multiple Levels
                            
                                How to highlight time ranges on a plot?
                            
                                Output in R, Avoid Writing "[1]"
                            
                                How can I stop a running R command in linux other than with Ctrl + C?
                            
                                How to convert dataframe into time series?
                            
                                Categorize continuous variable with dplyr [duplicate]
                            
                                R self reference
                            
                                figure captions, references using knitr and markdown to html
                            
                                What are the double colons (::) in R?
                            
                                Why can't I get a p-value smaller than 2.2e-16?
                            
                                R - Finding closest neighboring point and number of neighbors within a given radius, coordinates lat-long
                            
                                How to skip error checking at Rmarkdown compiling?
                            
                                Get row and column indices of matches using `which()`
                            
                                Using identical() in R with multiple vectors

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With