I understand it is useful to preallocate a vector or a matrix, because they are always stored in a contiguous memory block.
However, in terms of a list, it can contains elements of difference length and modes. So my first guess is that a list might merely contains pointers to the true address of its elements. Am I correct? A related question here What is the internal implementation of lists? It says the list is essentially an array, but it does't cover how the elements are stored in a list while size of each element may change.
Example 1: If a list contains a,b,c,d,e
, and when myList$a<-1:1000000
, is the list modified in place (which means only a
is updated) or the whole list are copied and updated?
Example 2
> system.time( { myList <- list()
+ myList$a <- 1:10000000
+ myList$b <- 1:10000100
+ myList$c <- 1:10000200
+ myList$d <- 1:10000300})
user system elapsed
0.01 0.02 0.03
> system.time({ myList2<-list(a=1:10000000,b=1:10000100,c=1:10000200,d=1:10000300) })
user system elapsed
0.00 0.03 0.03
would myList
be very inefficient compared to myList2
due to no preallocation? Or no noticeable performance difference at all no matter how large a,b,c,d
are because the first one copies only pointers a few times more?
Here comes to preallocation. How does it look like for a list? Does it only preallocate the memory for the pointers? If that's the case, I don't see any use since there won't be much data copying for pointers anyway.
> system.time( { myList <- vector("list", 4)
+ myList[[1]] <- 1:10000000
+ myList[[2]] <- 1:10000100
+ myList[[3]] <- 1:10000200
+ myList[[4]] <- 1:10000300
+ names(myList) <- letters[1:4]})
user system elapsed
0.01 0.02 0.03
If you preallocate a 1-by-1,000,000 block of memory for x and initialize it to zero, then the code runs much faster because there is no need to repeatedly reallocate memory for the growing data structure.
You don't need to preallocate anything. Behind the scenes, the list type will periodically allocate more space than it needs for its immediate use to amortize the cost of resizing the underlying array across multiple updates.
Pre-allocating Space Thus, if you know (or have a good guess) at how big the array needs to be in the first place, you can "pre-allocate" it, meaning pre-size the array.
"Pre-allocated memory" means that a program should allocate all the required memory blocks once after startup (using the new operator, as usual), rather than allocate memory multiple times during execution and leave memory which is no longer needed for the garbage collector to free.
This is too long for a comment -- but not a complete answer.
Named lists are treated differently to unnamed lists in terms of copying when modified.
Copying (for large objects)
This is a complex issue. See http://radfordneal.wordpress.com/2013/07/02/fixing-rs-named-problems-in-pqr/ for a reasonable explanation and note that some of these changes are being implemented in the development version of R. Also see http://r.789695.n4.nabble.com/Confused-about-NAMED-td4103326.html for more understanding of the underlying complexities
Here are some timings for various possibilities of preallocation
# preallocated contents so timing is list related only
.a <- seq_len(1e6); .b <- seq_len((1e6 + 1))
.c <- seq_len((1e6 + 2)); .d <- seq_len((1e6 + 3))
f1 <- function(){
# using `$<-` empty list
x <- list()
x$a <- .a
x$b <- .b
x$c <- .c
x$d <- .d
x
}
f2 <- function(){
# using `[[<-` on empty list
x <- list()
x[['a']] <- .a
x[['b']] <- .b
x[['c']] <- .c
x[['d']] <- .d
x
}
f3 <- function(){
# using `[[<-` on empty list, naming later
x <- list()
x[[1]] <- .a
x[[2]] <- .b
x[[3]] <- .c
x[[4]] <- .d
names(x) <- letters[1:4]
x
}
f4 <- function(){
# just define the list
x <- list(a = .a, b = .b,
c = .c, d = .d)
}
f5 <- function(){
# create a list of length 4, then fill and name
x <- vector(mode = 'list', length = 4)
x[[1]] <- .a
x[[2]] <- .b
x[[3]] <- .c
x[[4]] <- .d
names(x) <- letters[1:4]
x
}
# f6 <- function(){
# # this doesn't work!
# # it creates a list of length 8
# x <- vector(mode = 'list', length = 4)
# x[['a']] <- .a
# x[['b']] <- .b
# x[['c']] <- .c
# x[['d']] <- .d
# x
# }
f7 <-function(){
# pre allocate list, name then fill
x <- vector(mode = 'list', length = 4)
names(x) <- letters[1:4]
x[[1]] <- .a
x[[2]] <- .b
x[[3]] <- .c
x[[4]] <- .d
x
}
f8 <- function(){
# preallocate correct length and then name
# and fill by name
x <- vector(mode = 'list', length = 4)
names(x) <- letters[1:4]
x[['a']] <- .a
x[['b']] <- .b
x[['c']] <- .c
x[['d']] <- .d
x
}
library(microbenchmark)
microbenchmark(f1(),f2(),f3(),f4(),f5(),f7(),f8(),times=100)
microbenchmark(f1(),f2(),f3(),f4(),f5(),f7(),f8(),times=100)
# Unit: microseconds
# expr min lq median uq max neval
# f1() 6.038 11.169 12.980 14.791 34.110 100
# f2() 2528.881 4387.962 4707.014 6472.823 74586.266 100
# f3() 2532.805 4537.376 4714.258 5353.722 74903.206 100
# f4() 2475.756 4531.489 4721.503 6331.860 74460.395 100
# f5() 2508.959 4512.474 4759.535 6673.551 7966.668 100
# f7() 2545.181 4477.761 4709.127 6372.610 7964.856 100
# f8() 2508.053 4467.799 4669.131 6181.993 74236.726 100
#
# All results are identical.
all(identical(f1(),f2()),identical(f1(),f3()),
identical(f1(),f4()),identical(f1(),f5()),
identical(f1(),f7()),identical(f1(),f8()))
# [1] TRUE
All results are identical.
Clearly using $<-` on an empty list is the clear winner. This goes against my thinking of what should be quickest
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With