I understand it is useful to preallocate a vector or a matrix, because they are always stored in a contiguous memory block. However, in terms of a list, it can contains elements of difference length and modes. So my first guess is that a list might merely contains pointers to the true address of its elements. Am I correct? A related question here What is the internal implementation of lists? It says the list is essentially an array, but it does't cover how the elements are stored in a list while size of each element may change. Example 1: If a list contains <code>a,b,c,d,e</code>, and when <code>myList$a<-1:1000000</code>, is the list modified in place (which means only <code>a</code> is updated) or the whole list are copied and updated? Example 2 <pre class="prettyprint"><code>> system.time( { myList <- list() + myList$a <- 1:10000000 + myList$b <- 1:10000100 + myList$c <- 1:10000200 + myList$d <- 1:10000300}) user system elapsed 0.01 0.02 0.03 > system.time({ myList2<-list(a=1:10000000,b=1:10000100,c=1:10000200,d=1:10000300) }) user system elapsed 0.00 0.03 0.03 </code></pre> would <code>myList</code> be very inefficient compared to <code>myList2</code> due to no preallocation? Or no noticeable performance difference at all no matter how large <code>a,b,c,d</code> are because the first one copies only pointers a few times more? Here comes to preallocation. How does it look like for a list? Does it only preallocate the memory for the pointers? If that's the case, I don't see any use since there won't be much data copying for pointers anyway. <pre class="prettyprint"><code>> system.time( { myList <- vector("list", 4) + myList[[1]] <- 1:10000000 + myList[[2]] <- 1:10000100 + myList[[3]] <- 1:10000200 + myList[[4]] <- 1:10000300 + names(myList) <- letters[1:4]}) user system elapsed 0.01 0.02 0.03 </code></pre>

This is too long for a comment -- but not a complete answer. Named lists are treated differently to unnamed lists in terms of copying when modified. Copying (for large objects) This is a complex issue. See http://radfordneal.wordpress.com/2013/07/02/fixing-rs-named-problems-in-pqr/ for a reasonable explanation and note that some of these changes are being implemented in the development version of R. Also see http://r.789695.n4.nabble.com/Confused-about-NAMED-td4103326.html for more understanding of the underlying complexities Here are some timings for various possibilities of preallocation <pre class="prettyprint"><code># preallocated contents so timing is list related only .a <- seq_len(1e6); .b <- seq_len((1e6 + 1)) .c <- seq_len((1e6 + 2)); .d <- seq_len((1e6 + 3)) f1 <- function(){ # using `$<-` empty list x <- list() x$a <- .a x$b <- .b x$c <- .c x$d <- .d x } f2 <- function(){ # using `[[<-` on empty list x <- list() x[['a']] <- .a x[['b']] <- .b x[['c']] <- .c x[['d']] <- .d x } f3 <- function(){ # using `[[<-` on empty list, naming later x <- list() x[[1]] <- .a x[[2]] <- .b x[[3]] <- .c x[[4]] <- .d names(x) <- letters[1:4] x } f4 <- function(){ # just define the list x <- list(a = .a, b = .b, c = .c, d = .d) } f5 <- function(){ # create a list of length 4, then fill and name x <- vector(mode = 'list', length = 4) x[[1]] <- .a x[[2]] <- .b x[[3]] <- .c x[[4]] <- .d names(x) <- letters[1:4] x } # f6 <- function(){ # # this doesn't work! # # it creates a list of length 8 # x <- vector(mode = 'list', length = 4) # x[['a']] <- .a # x[['b']] <- .b # x[['c']] <- .c # x[['d']] <- .d # x # } f7 <-function(){ # pre allocate list, name then fill x <- vector(mode = 'list', length = 4) names(x) <- letters[1:4] x[[1]] <- .a x[[2]] <- .b x[[3]] <- .c x[[4]] <- .d x } f8 <- function(){ # preallocate correct length and then name # and fill by name x <- vector(mode = 'list', length = 4) names(x) <- letters[1:4] x[['a']] <- .a x[['b']] <- .b x[['c']] <- .c x[['d']] <- .d x } library(microbenchmark) microbenchmark(f1(),f2(),f3(),f4(),f5(),f7(),f8(),times=100) microbenchmark(f1(),f2(),f3(),f4(),f5(),f7(),f8(),times=100) # Unit: microseconds # expr min lq median uq max neval # f1() 6.038 11.169 12.980 14.791 34.110 100 # f2() 2528.881 4387.962 4707.014 6472.823 74586.266 100 # f3() 2532.805 4537.376 4714.258 5353.722 74903.206 100 # f4() 2475.756 4531.489 4721.503 6331.860 74460.395 100 # f5() 2508.959 4512.474 4759.535 6673.551 7966.668 100 # f7() 2545.181 4477.761 4709.127 6372.610 7964.856 100 # f8() 2508.053 4467.799 4669.131 6181.993 74236.726 100 # # All results are identical. all(identical(f1(),f2()),identical(f1(),f3()), identical(f1(),f4()),identical(f1(),f5()), identical(f1(),f7()),identical(f1(),f8())) </code></pre> # [1] TRUE All results are identical. Clearly using $<-` on an empty list is the clear winner. This goes against my thinking of what should be quickest

Why is preallocation useful for list?

Q: What are benefits to Preallocation Matlab?

If you preallocate a 1-by-1,000,000 block of memory for x and initialize it to zero, then the code runs much faster because there is no need to repeatedly reallocate memory for the growing data structure.

Q: Should I preallocate array Python?

You don't need to preallocate anything. Behind the scenes, the list type will periodically allocate more space than it needs for its immediate use to amortize the cost of resizing the underlying array across multiple updates.

Q: What does it mean to preallocate an array?

Pre-allocating Space Thus, if you know (or have a good guess) at how big the array needs to be in the first place, you can "pre-allocate" it, meaning pre-size the array.

Q: What is Preallocate memory?

"Pre-allocated memory" means that a program should allocate all the required memory blocks once after startup (using the new operator, as usual), rather than allocate memory multiple times during execution and leave memory which is no longer needed for the garbage collector to free.

Tags:

r

I understand it is useful to preallocate a vector or a matrix, because they are always stored in a contiguous memory block.

However, in terms of a list, it can contains elements of difference length and modes. So my first guess is that a list might merely contains pointers to the true address of its elements. Am I correct? A related question here What is the internal implementation of lists? It says the list is essentially an array, but it does't cover how the elements are stored in a list while size of each element may change.

Example 1: If a list contains a,b,c,d,e, and when myList$a<-1:1000000, is the list modified in place (which means only a is updated) or the whole list are copied and updated?

Example 2

> system.time( { myList <- list()
+                myList$a <- 1:10000000
+                myList$b <- 1:10000100
+                myList$c <- 1:10000200 
+                myList$d <- 1:10000300})
   user  system elapsed 
   0.01    0.02    0.03

> system.time({ myList2<-list(a=1:10000000,b=1:10000100,c=1:10000200,d=1:10000300) })
   user  system elapsed 
   0.00    0.03    0.03

would myList be very inefficient compared to myList2 due to no preallocation? Or no noticeable performance difference at all no matter how large a,b,c,d are because the first one copies only pointers a few times more?

Here comes to preallocation. How does it look like for a list? Does it only preallocate the memory for the pointers? If that's the case, I don't see any use since there won't be much data copying for pointers anyway.

> system.time( { myList <- vector("list", 4)
+                myList[[1]] <- 1:10000000
+                myList[[2]] <- 1:10000100
+                myList[[3]]  <- 1:10000200 
+                myList[[4]] <- 1:10000300
+                names(myList) <- letters[1:4]})
   user  system elapsed 
   0.01    0.02    0.03

528

asked Aug 25 '13 00:08

colinfang

1 Answers

This is too long for a comment -- but not a complete answer.

Named lists are treated differently to unnamed lists in terms of copying when modified.

Copying (for large objects)

This is a complex issue. See http://radfordneal.wordpress.com/2013/07/02/fixing-rs-named-problems-in-pqr/ for a reasonable explanation and note that some of these changes are being implemented in the development version of R. Also see http://r.789695.n4.nabble.com/Confused-about-NAMED-td4103326.html for more understanding of the underlying complexities

Here are some timings for various possibilities of preallocation

# preallocated contents so timing is list related only
.a <- seq_len(1e6); .b <- seq_len((1e6 + 1))
.c <- seq_len((1e6 + 2)); .d <- seq_len((1e6 + 3))



f1 <- function(){
# using `$<-` empty list
  x <- list()
  x$a <- .a
  x$b <- .b
  x$c <- .c 
  x$d <- .d
  x
}


f2 <- function(){
  # using `[[<-` on empty list
  x <- list()
  x[['a']] <- .a
  x[['b']] <- .b
  x[['c']] <- .c
  x[['d']] <- .d
  x
}


f3 <- function(){
  # using `[[<-` on empty list, naming later
  x <- list()
  x[[1]] <- .a
  x[[2]] <- .b
  x[[3]] <- .c
  x[[4]] <- .d
  names(x) <- letters[1:4]
  x
}

f4 <- function(){ 
  # just define the list
  x <- list(a = .a, b = .b,
            c = .c, d = .d)
}


f5 <- function(){ 
  # create a list of length 4, then fill and name
  x <- vector(mode = 'list', length = 4)
  x[[1]] <- .a
  x[[2]] <- .b
  x[[3]] <- .c
  x[[4]] <- .d
  names(x) <- letters[1:4]
  x
}

# f6 <- function(){
#  # this doesn't work!
#  # it creates a list of length 8
#  x <- vector(mode = 'list', length = 4)
#  x[['a']] <- .a
#  x[['b']] <- .b
#  x[['c']] <- .c
#  x[['d']] <- .d
#  x
# }


f7 <-function(){
# pre allocate list, name then fill
x <- vector(mode = 'list', length = 4)
  names(x) <- letters[1:4]
  x[[1]] <- .a
  x[[2]] <- .b
  x[[3]] <- .c
  x[[4]] <- .d
  x
}

f8 <- function(){
# preallocate correct length and then name
# and fill by name
  x <- vector(mode = 'list', length = 4)
  names(x) <- letters[1:4]
  x[['a']] <- .a
  x[['b']] <- .b
  x[['c']] <- .c
  x[['d']] <- .d
  x
}

library(microbenchmark)
microbenchmark(f1(),f2(),f3(),f4(),f5(),f7(),f8(),times=100)


microbenchmark(f1(),f2(),f3(),f4(),f5(),f7(),f8(),times=100)
# Unit: microseconds
#   expr      min       lq   median       uq       max neval
#   f1()    6.038   11.169   12.980   14.791    34.110   100
#   f2() 2528.881 4387.962 4707.014 6472.823 74586.266   100
#   f3() 2532.805 4537.376 4714.258 5353.722 74903.206   100
#   f4() 2475.756 4531.489 4721.503 6331.860 74460.395   100
#   f5() 2508.959 4512.474 4759.535 6673.551  7966.668   100
#   f7() 2545.181 4477.761 4709.127 6372.610  7964.856   100
#   f8() 2508.053 4467.799 4669.131 6181.993 74236.726   100
#
#  All results are identical.
all(identical(f1(),f2()),identical(f1(),f3()),
    identical(f1(),f4()),identical(f1(),f5()),
    identical(f1(),f7()),identical(f1(),f8()))

# [1] TRUE

All results are identical.

Clearly using $<-` on an empty list is the clear winner. This goes against my thinking of what should be quickest

answered Oct 31 '22 08:10

mnel

Related questions
                            
                                Identifying overlap zones in R raster package
                            
                                Is there an equivalent of R's dput() for Matlab?
                            
                                Constraining stat_smooth to a particular range
                            
                                Difference between tests/ and inst/tests/ for R packages
                            
                                How to place divisions between facet grid lines
                            
                                how do I replace numeric codes with value labels from a lookup table?
                            
                                Match with negation
                            
                                How to remove a lower order parameter in a model when the higher order parameters remain?
                            
                                How do I transform a vector and a list of lists into a data.frame in R?
                            
                                Parsing ISO8601 in R
                            
                                Duelling dendrograms in r (Placing dendrograms back to back in r)
                            
                                Preventing a second legend
                            
                                Usage of local variables
                            
                                have plot points coloured on a spectrum in R
                            
                                Applying a function to a multidimensional array with grouping variable
                            
                                when [:punct:] is too much [duplicate]
                            
                                Generate a random number from a density object (or more broadly from a set of numbers)
                            
                                Difference between .Rprofile and .First
                            
                                R - Evaluate a nested function in an environment
                            
                                How can I access Windows COM objects in R v3?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With