I am having difficulty grasping the essence of the <code>setDT()</code> function. As I read code on SO, I frequently come across the use of <code>setDT()</code> to create a data.table. Of course the use of <code>data.table()</code> is ubiquitous. I feel like I solidly comprehend the nature of <code>data.table()</code> yet the relevance of <code>setDT()</code> eludes me. <code>?setDT</code> tells me this: <blockquote> <code>setDT</code> converts lists (both named and unnamed) and data.frames to data.tables by reference. </blockquote> as well as: <blockquote> In <code>data.table</code> parlance, all <code>set*</code> functions change their input by reference. That is, no copy is made at all, other than temporary working memory, which is as large as one column. </blockquote> So this makes me think I should only use <code>setDT()</code> to make a data.table, right? Is <code>setDT()</code> simply a list to data.table converter? <pre class="prettyprint"><code>library(data.table) a <- letters[c(19,20,1,3,11,15,22,5,18,6,12,15,23)] b <- seq(1,41,pi) ab <- data.frame(a,b) d <- data.table(ab) e <- setDT(ab) str(d) #Classes ‘data.table’ and 'data.frame': 13 obs. of 2 variables: # $ a: Factor w/ 12 levels "a","c","e","f",..: 9 10 1 2 5 7 11 3 8 4 ... # $ b: num 1 4.14 7.28 10.42 13.57 ... # - attr(*, ".internal.selfref")=<externalptr> str(e) #Classes ‘data.table’ and 'data.frame': 13 obs. of 2 variables: # $ a: Factor w/ 12 levels "a","c","e","f",..: 9 10 1 2 5 7 11 3 8 4 ... # $ b: num 1 4.14 7.28 10.42 13.57 ... # - attr(*, ".internal.selfref")=<externalptr> </code></pre> Seemingly no difference in this instance. In another instance the difference is evident: <pre class="prettyprint"><code>ba <- list(a,b) f <- data.table(ba) g <- setDT(ba) str(f) #Classes ‘data.table’ and 'data.frame': 2 obs. of 1 variable: # $ ba:List of 2 # ..$ : chr "s" "t" "a" "c" ... # ..$ : num 1 4.14 7.28 10.42 13.57 ... # - attr(*, ".internal.selfref")=<externalptr> str(g) #Classes ‘data.table’ and 'data.frame': 13 obs. of 2 variables: # $ V1: chr "s" "t" "a" "c" ... # $ V2: num 1 4.14 7.28 10.42 13.57 ... # - attr(*, ".internal.selfref")=<externalptr> </code></pre> When should I use <code>setDT()</code>? What makes <code>setDT()</code> relevant? Why not just make the original <code>data.table()</code> function capable of doing what <code>setDT()</code> is able to do?

Update: <blockquote> @Roland makes some good points in the comments section, and the post is better for them. While I originally focused on memory overflow issues, he pointed out that even if this doesn't happen, memory management of various copies takes substantial time, which is a more common everyday concern. Examples of both issues have now been added as well. </blockquote> I like this question on stackoverflow because I think it is really about avoiding stack overflow in R when dealing with larger data sets. 😊 Those who are unfamiliar with <code>data.table</code> family of <code>set</code> operations may benefit from this discussion! One should use <code>setDT()</code> when working with larger data sets that take up a considerable amount of RAM because the operation will modify each object in place, conserving memory. For data that is a very small percentage of RAM, using data.table’s copy-and-modify is fine. The creation of the <code>setDT</code> function was actually inspired by the following thread on stack overflow, which is about working with a large data set (several GB's). You will see Matt Dowle chime in an suggest the 'setDT' name. Convert a data frame to a data.table without copy A bit more depth: With R, data is stored in memory. This speeds things up considerably because RAM is much faster to access than storage devices. However, a problem can arise when one’s data set is a large portion of RAM. Why? Because base R has a tendency to make copies of each <code>data.frame</code> when some operations are applied to them. This has improved after version 3.1, but addressing that is beyond the scope of this post. If one is pulling multiple <code>data.frame</code>s or <code>list</code>s into one <code>data.frame</code> or <code>data.table</code>, your memory usage will expand rather quickly because at some point during the operation, multiple copies of your data exist in RAM. If the data set is big enough, you may run out of memory when all the copies are produced, and your stack will overflow. See example of this below. We get an error, and original memory address and class of object does not change. <pre class="prettyprint"><code>> N <- 1e8 > P <- 1e2 > data <- as.data.frame(rep(data.frame(rnorm(N)), P)) > > pryr::object_size(data) 800 MB > > tracemem(data) [1] "<0000000006D2DF18>" > > data <- data.table(data) Error: cannot allocate vector of size 762.9 Mb > > tracemem(data) [1] "<0000000006D2DF18>" > class(data) [1] "data.frame" > </code></pre> The ability to just modify the object in place without copying is a big deal. That is what <code>setDT</code> does when it takes a <code>list</code> or <code>data.frame</code> and returns a <code>data.table</code>. The same example as above using <code>setDT</code>, now works fine and without error. Both class and memory address change, and no copies take place. <pre class="prettyprint"><code>> tracemem(data) [1] "<0000000006D2DF18>" > class(data) [1] "data.frame" > > setDT(data) > > tracemem(data) [1] "<0000000006A8C758>" > class(data) [1] "data.table" "data.frame" </code></pre> @Roland points out that for most people, the bigger concern is speed, which suffers as a side effect of such intensive use of memory management. Here is an example with smaller data that does not crash the cpu, and illustrates how much faster <code>setDT</code> is for this job. Notice the results of 'tracemem' in the wake of <code>data <- data.table(data)</code>, making copies of <code>data</code>. Contrast that with <code>setDT(data)</code> which doesn't print a single copy. We have to then call <code>tracemem(data)</code> to see the new memory address. <pre class="prettyprint"><code>> N <- 1e5 > P <- 1e2 > data <- as.data.frame(rep(data.frame(rnorm(N)), P)) > pryr::object_size(data) 808 kB > # data.table method > tracemem(data) [1] "<0000000019098438>" > data <- data.table(data) tracemem[0x0000000019098438 -> 0x0000000007aad7d8]: data.table tracemem[0x0000000007aad7d8 -> 0x0000000007c518b8]: copy as.data.table.data.frame as.data.table data.table tracemem[0x0000000007aad7d8 -> 0x0000000018e454c8]: as.list.data.frame as.list vapply copy as.data.table.data.frame as.data.table data.table > class(data) [1] "data.table" "data.frame" > > # setDT method > # back to data.frame > data <- as.data.frame(data) > class(data) [1] "data.frame" > tracemem(data) [1] "<00000000125BE1A0>" > setDT(data) > tracemem(data) [1] "<00000000125C2840>" > class(data) [1] "data.table" "data.frame" > </code></pre> How does this impact timing? As we can see, <code>setDT</code> is much faster for it. <pre class="prettyprint"><code>> # timing example > data <- as.data.frame(rep(data.frame(rnorm(N)), P)) > microbenchmark(setDT(data), data <- data.table(data)) Unit: microseconds expr min lq mean median max neval uq setDT(data) 49.948 55.7635 69.66017 73.553 100.238 100 79.198 data <- data.table(data) 54594.289 61238.8830 81545.64432 64179.131 611632.427 100 68647.917 </code></pre> Set functions can be used in many areas, not just when converting objects to a data.tables. You can find more information on the reference semantics and how to apply them elsewhere by calling the vignette on the subject. <pre class="prettyprint"><code>library(data.table) vignette("datatable-reference-semantics") </code></pre> This is a great question and those thinking of using R with larger data sets or who just want to speed up data manipulation actives, can benefit from being familiar with the significant performance improvements of <code>data.table</code> reference semantics.

<code>setDT()</code> is not a replacement for <code>data.table()</code>. It's a more efficient replacement for <code>as.data.table()</code> which can be used with certain types of objects. <ul> <li> <code>mydata <- as.data.table(mydata)</code> will copy the object behind <code>mydata</code>, convert the copy to a <code>data.table</code>, then change the <code>mydata</code> symbol to point to the copy.</li> <li> <code>setDT(mydata)</code> will change the object behind <code>mydata</code> to a <code>data.table</code>. No copying is done.</li> </ul> So what's a realistic situation to use <code>setDT()</code>? When you can't control the class of the original data. For instance, most packages for working with databases give <code>data.frame</code> output. In that case, your code would be something like <pre class="prettyprint"><code>mydata <- dbGetQuery(conn, "SELECT * FROM mytable") # Returns a data.frame setDT(mydata) # Make it a data.table </code></pre> When should you use <code>as.data.table(x)</code>? Whenever <code>x</code> isn't a <code>list</code> or <code>data.frame</code>. The most common use is for matrices.

When should I use setDT() instead of data.table() to create a data.table?

Tags:

r

data.table

I am having difficulty grasping the essence of the setDT() function. As I read code on SO, I frequently come across the use of setDT() to create a data.table. Of course the use of data.table() is ubiquitous. I feel like I solidly comprehend the nature of data.table() yet the relevance of setDT() eludes me. ?setDT tells me this:

setDT converts lists (both named and unnamed) and data.frames to data.tables by reference.

as well as:

In data.table parlance, all set* functions change their input by reference. That is, no copy is made at all, other than temporary working memory, which is as large as one column.

So this makes me think I should only use setDT() to make a data.table, right? Is setDT() simply a list to data.table converter?

library(data.table)

a <- letters[c(19,20,1,3,11,15,22,5,18,6,12,15,23)]
b <- seq(1,41,pi)
ab <- data.frame(a,b)
d <- data.table(ab)
e <- setDT(ab)

str(d)
#Classes ‘data.table’ and 'data.frame': 13 obs. of  2 variables:
# $ a: Factor w/ 12 levels "a","c","e","f",..: 9 10 1 2 5 7 11 3 8 4 ...
# $ b: num  1 4.14 7.28 10.42 13.57 ...
# - attr(*, ".internal.selfref")=<externalptr>

str(e)
#Classes ‘data.table’ and 'data.frame': 13 obs. of  2 variables:
# $ a: Factor w/ 12 levels "a","c","e","f",..: 9 10 1 2 5 7 11 3 8 4 ...
# $ b: num  1 4.14 7.28 10.42 13.57 ...
# - attr(*, ".internal.selfref")=<externalptr>

Seemingly no difference in this instance. In another instance the difference is evident:

ba <- list(a,b)
f <- data.table(ba)
g <- setDT(ba)

str(f)
#Classes ‘data.table’ and 'data.frame': 2 obs. of  1 variable:
# $ ba:List of 2
#  ..$ : chr  "s" "t" "a" "c" ...
#  ..$ : num  1 4.14 7.28 10.42 13.57 ...
# - attr(*, ".internal.selfref")=<externalptr>

str(g)
#Classes ‘data.table’ and 'data.frame': 13 obs. of  2 variables:
# $ V1: chr  "s" "t" "a" "c" ...
# $ V2: num  1 4.14 7.28 10.42 13.57 ...
# - attr(*, ".internal.selfref")=<externalptr>

When should I use setDT()? What makes setDT() relevant? Why not just make the original data.table() function capable of doing what setDT() is able to do?

897

asked Jan 29 '17 05:01

Dodge

2 Answers

Update:

@Roland makes some good points in the comments section, and the post is better for them. While I originally focused on memory overflow issues, he pointed out that even if this doesn't happen, memory management of various copies takes substantial time, which is a more common everyday concern. Examples of both issues have now been added as well.

I like this question on stackoverflow because I think it is really about avoiding stack overflow in R when dealing with larger data sets. 😊 Those who are unfamiliar with data.table family of set operations may benefit from this discussion!

One should use setDT() when working with larger data sets that take up a considerable amount of RAM because the operation will modify each object in place, conserving memory. For data that is a very small percentage of RAM, using data.table’s copy-and-modify is fine.

The creation of the setDT function was actually inspired by the following thread on stack overflow, which is about working with a large data set (several GB's). You will see Matt Dowle chime in an suggest the 'setDT' name.

Convert a data frame to a data.table without copy

A bit more depth:

With R, data is stored in memory. This speeds things up considerably because RAM is much faster to access than storage devices. However, a problem can arise when one’s data set is a large portion of RAM. Why? Because base R has a tendency to make copies of each data.frame when some operations are applied to them. This has improved after version 3.1, but addressing that is beyond the scope of this post. If one is pulling multiple data.frames or lists into one data.frame or data.table, your memory usage will expand rather quickly because at some point during the operation, multiple copies of your data exist in RAM. If the data set is big enough, you may run out of memory when all the copies are produced, and your stack will overflow. See example of this below. We get an error, and original memory address and class of object does not change.

> N <- 1e8
> P <- 1e2
> data <- as.data.frame(rep(data.frame(rnorm(N)), P))
> 
> pryr::object_size(data)
800 MB
> 
> tracemem(data)
[1] "<0000000006D2DF18>"
> 
> data <- data.table(data)
Error: cannot allocate vector of size 762.9 Mb
> 
> tracemem(data)
[1] "<0000000006D2DF18>"
> class(data)
[1] "data.frame"
>

The ability to just modify the object in place without copying is a big deal. That is what setDT does when it takes a list or data.frame and returns a data.table. The same example as above using setDT, now works fine and without error. Both class and memory address change, and no copies take place.

> tracemem(data)
[1] "<0000000006D2DF18>"
> class(data)
[1] "data.frame"
> 
> setDT(data)
>  
> tracemem(data)
[1] "<0000000006A8C758>"
> class(data)
[1] "data.table" "data.frame"

@Roland points out that for most people, the bigger concern is speed, which suffers as a side effect of such intensive use of memory management. Here is an example with smaller data that does not crash the cpu, and illustrates how much faster setDT is for this job. Notice the results of 'tracemem' in the wake of data <- data.table(data), making copies of data. Contrast that with setDT(data) which doesn't print a single copy. We have to then call tracemem(data) to see the new memory address.

> N <- 1e5
> P <- 1e2
> data <- as.data.frame(rep(data.frame(rnorm(N)), P))
> pryr::object_size(data)
808 kB

> # data.table method
> tracemem(data)
[1] "<0000000019098438>"
> data <- data.table(data)
tracemem[0x0000000019098438 -> 0x0000000007aad7d8]: data.table 
tracemem[0x0000000007aad7d8 -> 0x0000000007c518b8]: copy as.data.table.data.frame as.data.table data.table 
tracemem[0x0000000007aad7d8 -> 0x0000000018e454c8]: as.list.data.frame as.list vapply copy as.data.table.data.frame as.data.table data.table 
> class(data)
[1] "data.table" "data.frame"
> 
> # setDT method
> # back to data.frame
> data <- as.data.frame(data)
> class(data)
[1] "data.frame"
> tracemem(data)
[1] "<00000000125BE1A0>"
> setDT(data)
> tracemem(data)
[1] "<00000000125C2840>"
> class(data)
[1] "data.table" "data.frame"
>

How does this impact timing? As we can see, setDT is much faster for it.

> # timing example
> data <- as.data.frame(rep(data.frame(rnorm(N)), P))
> microbenchmark(setDT(data), data <- data.table(data))
Unit: microseconds
                     expr       min         lq        mean    median            max neval        uq
              setDT(data)    49.948    55.7635    69.66017    73.553        100.238   100    79.198
 data <- data.table(data) 54594.289 61238.8830 81545.64432 64179.131     611632.427   100 68647.917

Set functions can be used in many areas, not just when converting objects to a data.tables. You can find more information on the reference semantics and how to apply them elsewhere by calling the vignette on the subject.

library(data.table)    
vignette("datatable-reference-semantics")

This is a great question and those thinking of using R with larger data sets or who just want to speed up data manipulation actives, can benefit from being familiar with the significant performance improvements of data.table reference semantics.

171

answered Oct 17 '22 10:10

Justin

setDT() is not a replacement for data.table(). It's a more efficient replacement for as.data.table() which can be used with certain types of objects.

mydata <- as.data.table(mydata) will copy the object behind mydata, convert the copy to a data.table, then change the mydata symbol to point to the copy.
setDT(mydata) will change the object behind mydata to a data.table. No copying is done.

So what's a realistic situation to use setDT()? When you can't control the class of the original data. For instance, most packages for working with databases give data.frame output. In that case, your code would be something like

mydata <- dbGetQuery(conn, "SELECT * FROM mytable")  # Returns a data.frame
setDT(mydata)                                        # Make it a data.table

When should you use as.data.table(x)? Whenever x isn't a list or data.frame. The most common use is for matrices.

answered Oct 17 '22 09:10

Nathan Werth

Related questions
                            
                                In R base plot, move axis label closer to axis
                            
                                Stacked bar chart
                            
                                Error - replacement has [x] rows, data has [y]
                            
                                How to make a sunburst plot in R or Python?
                            
                                Remove rows in R matrix where all data is NA [duplicate]
                            
                                Change background color of R plot
                            
                                Find the index position of the first non-NA value in an R vector?
                            
                                Export data from R to Excel
                            
                                assign headers based on existing row in dataframe in R
                            
                                State name to abbreviation
                            
                                How to drop columns by name pattern in R?
                            
                                How to extract Month from date in R
                            
                                suppress NAs in paste()
                            
                                How can I drop unused levels from a data frame?
                            
                                Figure position in markdown when converting to PDF with knitr and pandoc
                            
                                rCharts nvd3 lineWithFocusChart Customization
                            
                                Is there a way to run R code from JavaScript?
                            
                                Techniques for finding near duplicate records
                            
                                Include files R?
                            
                                What is the difference between cat and print?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With