I was preallocating a big data.frame to fill in later, which I normally do with NA
's like this:
n <- 1e6
a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)
and I wondered if it would make things any faster later if I specified data types up front, so I tested
f1 <- function() {
a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)
a$c2 <- 1:n
a$c3 <- sample(LETTERS, size= n, replace = TRUE)
}
f2 <- function() {
b <- data.frame(c1 = 1:n, c2 = numeric(n), c3 = character(n))
b$c2 <- 1:n
b$c3 <- sample(LETTERS, size= n, replace = TRUE)
}
> system.time(f1())
user system elapsed
0.219 0.042 0.260
> system.time(f2())
user system elapsed
1.018 0.052 1.072
So it was actually much slower! I tried again with a factor column too, and the difference wasn't closer to 2x than 4x, but I'm curious about why this is slower, and wonder if it is ever appropriate to initialize with data types rather than NA
's.
--
Edit: Flodel pointed out that 1:n is integer, not numeric. With that correction the runtimes are nearly identical; of course it hurts to incorrectly specify a data type and change it later!
Assigning any data to a large data frame takes time. If you're going to assign your data all at once in a vector (as you should), it's much faster not to assign the c2 and c3 columns in the original definition at all. For example:
f3 <- function() {
c <- data.frame(c1 = 1:n)
c$c2 <- 1:n
c$c3 <- sample(LETTERS, size= n, replace = TRUE)
}
print(system.time(f1()))
# user system elapsed
# 0.194 0.023 0.216
print(system.time(f2()))
# user system elapsed
# 0.336 0.037 0.374
print(system.time(f3()))
# user system elapsed
# 0.057 0.007 0.063
The reason for this is that when you preassign, a column of length n
is created. eg
str(data.frame(x=1:2, y = character(2)))
## 'data.frame': 2 obs. of 2 variables:
## $ x: int 1 2
## $ y: Factor w/ 1 level "": 1 1
Note that the character
column has been converted to factor
which will be slower than setting stringsAsFactors = F
.
@David Robinson's answer is correct, but I will add some profiling here to show how to investigate why some thngs are slower than you might expect.
The best thing to do here is to do some profiling to see what is being called, that can give a clue as to why some things calls are slower than others
library(profr)
profr(f1())
## Read 9 items
## f level time start end leaf source
## 8 f1 1 0.16 0.00 0.16 FALSE <NA>
## 9 data.frame 2 0.04 0.00 0.04 TRUE base
## 10 $<- 2 0.02 0.04 0.06 FALSE base
## 11 sample 2 0.04 0.06 0.10 TRUE base
## 12 $<- 2 0.06 0.10 0.16 FALSE base
## 13 $<-.data.frame 3 0.12 0.04 0.16 TRUE base
profr(f2())
## Read 15 items
## f level time start end leaf source
## 8 f2 1 0.28 0.00 0.28 FALSE <NA>
## 9 data.frame 2 0.12 0.00 0.12 TRUE base
## 10 : 2 0.02 0.12 0.14 TRUE base
## 11 $<- 2 0.02 0.18 0.20 FALSE base
## 12 sample 2 0.02 0.20 0.22 TRUE base
## 13 $<- 2 0.06 0.22 0.28 FALSE base
## 14 as.data.frame 3 0.08 0.04 0.12 FALSE base
## 15 $<-.data.frame 3 0.10 0.18 0.28 TRUE base
## 16 as.data.frame.character 4 0.08 0.04 0.12 FALSE base
## 17 factor 5 0.08 0.04 0.12 FALSE base
## 18 unique 6 0.06 0.04 0.10 FALSE base
## 19 match 6 0.02 0.10 0.12 TRUE base
## 20 unique.default 7 0.06 0.04 0.10 TRUE base
profr(f3())
## Read 4 items
## f level time start end leaf source
## 8 f3 1 0.06 0.00 0.06 FALSE <NA>
## 9 $<- 2 0.02 0.00 0.02 FALSE base
## 10 sample 2 0.04 0.02 0.06 TRUE base
## 11 $<-.data.frame 3 0.02 0.00 0.02 TRUE base
clearly f2()
is slower than f1()
as there is a lot of character
to factor
conversions, and recreating levels
etc.
For efficient use of memory I would suggest the data.table
package. This avoids (as much as possible) the internal copying of objects
library(data.table)
f4 <- function(){
f <- data.table(c1 = 1:n)
f[,c2:=1L:n]
f[,c3:=sample(LETTERS, size= n, replace = TRUE)]
}
system.time(f1())
## user system elapsed
## 0.15 0.02 0.18
system.time(f2())
## user system elapsed
## 0.19 0.00 0.19
system.time(f3())
## user system elapsed
## 0.09 0.00 0.09
system.time(f4())
## user system elapsed
## 0.04 0.00 0.04
Note, that using data.table
you could add two columns at once (and by reference)
# Thanks to @Thell for pointing this out.
f[,`:=`(c('c2','c3'), list(1L:n, sample(LETTERS,n, T))), with = F]
n= 1e7
f1 <- function() {
a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)
a$c2 <- 1:n
a$c3 <- sample(LETTERS, size = n, replace = TRUE)
a
}
f2 <- function() {
b <- data.frame(c1 = 1:n, c2 = numeric(n), c3 = character(n))
b$c2 <- 1:n
b$c3 <- sample(LETTERS, size = n, replace = TRUE)
b
}
f3 <- function() {
c <- data.frame(c1 = 1:n)
c$c2 <- 1:n
c$c3 <- sample(LETTERS, size = n, replace = TRUE)
c
}
f4 <- function() {
f <- data.table(c1 = 1:n)
f[, `:=`(c2, 1L:n)]
f[, `:=`(c3, sample(LETTERS, size = n, replace = TRUE))]
}
system.time(f1())
## user system elapsed
## 1.62 0.34 2.13
system.time(f2())
## user system elapsed
## 2.14 0.66 2.79
system.time(f3())
## user system elapsed
## 0.78 0.25 1.03
system.time(f4())
## user system elapsed
## 0.37 0.08 0.46
profr(f1())
## Read 105 items
## f level time start end leaf source
## 8 f1 1 2.08 0.00 2.08 FALSE <NA>
## 9 data.frame 2 0.66 0.00 0.66 FALSE base
## 10 : 2 0.02 0.66 0.68 TRUE base
## 11 $<- 2 0.32 0.84 1.16 FALSE base
## 12 sample 2 0.40 1.16 1.56 TRUE base
## 13 $<- 2 0.32 1.76 2.08 FALSE base
## 14 : 3 0.02 0.00 0.02 TRUE base
## 15 as.data.frame 3 0.04 0.02 0.06 FALSE base
## 16 unlist 3 0.12 0.54 0.66 TRUE base
## 17 $<-.data.frame 3 1.24 0.84 2.08 TRUE base
## 18 as.data.frame.integer 4 0.04 0.02 0.06 TRUE base
profr(f2())
## Read 145 items
## f level time start end leaf source
## 8 f2 1 2.88 0.00 2.88 FALSE <NA>
## 9 data.frame 2 1.40 0.00 1.40 FALSE base
## 10 : 2 0.04 1.40 1.44 TRUE base
## 11 $<- 2 0.36 1.64 2.00 FALSE base
## 12 sample 2 0.40 2.00 2.40 TRUE base
## 13 $<- 2 0.36 2.52 2.88 FALSE base
## 14 : 3 0.02 0.00 0.02 TRUE base
## 15 numeric 3 0.06 0.02 0.08 TRUE base
## 16 character 3 0.04 0.08 0.12 TRUE base
## 17 as.data.frame 3 1.06 0.12 1.18 FALSE base
## 18 unlist 3 0.20 1.20 1.40 TRUE base
## 19 $<-.data.frame 3 1.24 1.64 2.88 TRUE base
## 20 as.data.frame.integer 4 0.04 0.12 0.16 TRUE base
## 21 as.data.frame.numeric 4 0.16 0.18 0.34 TRUE base
## 22 as.data.frame.character 4 0.78 0.40 1.18 FALSE base
## 23 factor 5 0.74 0.40 1.14 FALSE base
## 24 as.data.frame.vector 5 0.04 1.14 1.18 TRUE base
## 25 unique 6 0.38 0.40 0.78 FALSE base
## 26 match 6 0.32 0.78 1.10 TRUE base
## 27 unique.default 7 0.38 0.40 0.78 TRUE base
profr(f3())
## Read 37 items
## f level time start end leaf source
## 8 f3 1 0.72 0.00 0.72 FALSE <NA>
## 9 data.frame 2 0.10 0.00 0.10 FALSE base
## 10 : 2 0.02 0.10 0.12 TRUE base
## 11 $<- 2 0.08 0.14 0.22 FALSE base
## 12 sample 2 0.26 0.22 0.48 TRUE base
## 13 $<- 2 0.16 0.56 0.72 FALSE base
## 14 : 3 0.02 0.00 0.02 TRUE base
## 15 as.data.frame 3 0.04 0.02 0.06 FALSE base
## 16 unlist 3 0.02 0.08 0.10 TRUE base
## 17 $<-.data.frame 3 0.58 0.14 0.72 TRUE base
## 18 as.data.frame.integer 4 0.04 0.02 0.06 TRUE base
profr(f4())
## Read 15 items
## f level time start end leaf source
## 8 f4 1 0.28 0.00 0.28 FALSE <NA>
## 9 data.table 2 0.02 0.00 0.02 FALSE data.table
## 10 [ 2 0.26 0.02 0.28 FALSE base
## 11 : 3 0.02 0.00 0.02 TRUE base
## 12 [.data.table 3 0.26 0.02 0.28 FALSE <NA>
## 13 eval 4 0.26 0.02 0.28 FALSE base
## 14 eval 5 0.26 0.02 0.28 FALSE base
## 15 : 6 0.02 0.02 0.04 TRUE base
## 16 sample 6 0.24 0.04 0.28 TRUE base
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With