Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is it slower to prespecify type in a data.frame?

I was preallocating a big data.frame to fill in later, which I normally do with NA's like this:

n <- 1e6
a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)

and I wondered if it would make things any faster later if I specified data types up front, so I tested

f1 <- function() {
    a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)
    a$c2 <- 1:n
    a$c3 <- sample(LETTERS, size= n, replace = TRUE)
}

f2 <- function() {
    b <- data.frame(c1 = 1:n, c2 = numeric(n), c3 = character(n))
    b$c2 <- 1:n
    b$c3 <- sample(LETTERS, size= n, replace = TRUE)
}

> system.time(f1())
   user  system elapsed 
  0.219   0.042   0.260 
> system.time(f2())
   user  system elapsed 
  1.018   0.052   1.072 

So it was actually much slower! I tried again with a factor column too, and the difference wasn't closer to 2x than 4x, but I'm curious about why this is slower, and wonder if it is ever appropriate to initialize with data types rather than NA's.

--

Edit: Flodel pointed out that 1:n is integer, not numeric. With that correction the runtimes are nearly identical; of course it hurts to incorrectly specify a data type and change it later!

like image 618
Gregor Thomas Avatar asked Sep 03 '12 23:09

Gregor Thomas


2 Answers

Assigning any data to a large data frame takes time. If you're going to assign your data all at once in a vector (as you should), it's much faster not to assign the c2 and c3 columns in the original definition at all. For example:

f3 <- function() {
    c <- data.frame(c1 = 1:n)
    c$c2 <- 1:n
    c$c3 <- sample(LETTERS, size= n, replace = TRUE)
}

print(system.time(f1()))
#   user  system elapsed 
#  0.194   0.023   0.216 
print(system.time(f2()))
#   user  system elapsed 
#  0.336   0.037   0.374 
print(system.time(f3()))
#   user  system elapsed 
#  0.057   0.007   0.063 

The reason for this is that when you preassign, a column of length n is created. eg

str(data.frame(x=1:2, y = character(2)))
## 'data.frame':    2 obs. of  2 variables:
## $ x: int  1 2
## $ y: Factor w/ 1 level "": 1 1

Note that the character column has been converted to factor which will be slower than setting stringsAsFactors = F.

like image 164
David Robinson Avatar answered Oct 21 '22 02:10

David Robinson


@David Robinson's answer is correct, but I will add some profiling here to show how to investigate why some thngs are slower than you might expect.

The best thing to do here is to do some profiling to see what is being called, that can give a clue as to why some things calls are slower than others

library(profr)
profr(f1())
## Read 9 items
##                 f level time start  end  leaf source
## 8              f1     1 0.16  0.00 0.16 FALSE   <NA>
## 9      data.frame     2 0.04  0.00 0.04  TRUE   base
## 10            $<-     2 0.02  0.04 0.06 FALSE   base
## 11         sample     2 0.04  0.06 0.10  TRUE   base
## 12            $<-     2 0.06  0.10 0.16 FALSE   base
## 13 $<-.data.frame     3 0.12  0.04 0.16  TRUE   base
profr(f2())
## Read 15 items
##                          f level time start  end  leaf source
## 8                       f2     1 0.28  0.00 0.28 FALSE   <NA>
## 9               data.frame     2 0.12  0.00 0.12  TRUE   base
## 10                       :     2 0.02  0.12 0.14  TRUE   base
## 11                     $<-     2 0.02  0.18 0.20 FALSE   base
## 12                  sample     2 0.02  0.20 0.22  TRUE   base
## 13                     $<-     2 0.06  0.22 0.28 FALSE   base
## 14           as.data.frame     3 0.08  0.04 0.12 FALSE   base
## 15          $<-.data.frame     3 0.10  0.18 0.28  TRUE   base
## 16 as.data.frame.character     4 0.08  0.04 0.12 FALSE   base
## 17                  factor     5 0.08  0.04 0.12 FALSE   base
## 18                  unique     6 0.06  0.04 0.10 FALSE   base
## 19                   match     6 0.02  0.10 0.12  TRUE   base
## 20          unique.default     7 0.06  0.04 0.10  TRUE   base
profr(f3())
## Read 4 items
##                f level time start  end  leaf source
## 8              f3     1 0.06  0.00 0.06 FALSE   <NA>
## 9             $<-     2 0.02  0.00 0.02 FALSE   base
## 10         sample     2 0.04  0.02 0.06  TRUE   base
## 11 $<-.data.frame     3 0.02  0.00 0.02  TRUE   base

clearly f2() is slower than f1() as there is a lot of character to factor conversions, and recreating levels etc.

For efficient use of memory I would suggest the data.table package. This avoids (as much as possible) the internal copying of objects

library(data.table)
f4 <- function(){
  f <- data.table(c1 = 1:n)
  f[,c2:=1L:n]
  f[,c3:=sample(LETTERS, size= n, replace = TRUE)]
}


system.time(f1())
##  user  system elapsed 
##  0.15    0.02    0.18 
system.time(f2())
## user  system elapsed 
## 0.19    0.00    0.19 
system.time(f3())
## user  system elapsed 
## 0.09    0.00    0.09 
system.time(f4())
## user  system elapsed 
## 0.04    0.00    0.04 

Note, that using data.table you could add two columns at once (and by reference)

  # Thanks to @Thell for pointing this out.
f[,`:=`(c('c2','c3'), list(1L:n, sample(LETTERS,n, T))), with = F]

EDIT -- functions that will return the required object (Well picked up @Dwin)

n= 1e7
f1 <- function() {
    a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)
    a$c2 <- 1:n
    a$c3 <- sample(LETTERS, size = n, replace = TRUE)
    a
}

f2 <- function() {
    b <- data.frame(c1 = 1:n, c2 = numeric(n), c3 = character(n))
    b$c2 <- 1:n
    b$c3 <- sample(LETTERS, size = n, replace = TRUE)
    b
}

f3 <- function() {
    c <- data.frame(c1 = 1:n)
    c$c2 <- 1:n
    c$c3 <- sample(LETTERS, size = n, replace = TRUE)
    c
}
f4 <- function() {
    f <- data.table(c1 = 1:n)
    f[, `:=`(c2, 1L:n)]
    f[, `:=`(c3, sample(LETTERS, size = n, replace = TRUE))]

}

system.time(f1())

##    user  system elapsed 
##    1.62    0.34    2.13 

system.time(f2())

##    user  system elapsed 
##    2.14    0.66    2.79 

system.time(f3())

##    user  system elapsed 
##    0.78    0.25    1.03 

system.time(f4())

##    user  system elapsed 
##    0.37    0.08    0.46 


profr(f1())
## Read 105 items
##                        f level time start  end  leaf source
## 8                     f1     1 2.08  0.00 2.08 FALSE   <NA>
## 9             data.frame     2 0.66  0.00 0.66 FALSE   base
## 10                     :     2 0.02  0.66 0.68  TRUE   base
## 11                   $<-     2 0.32  0.84 1.16 FALSE   base
## 12                sample     2 0.40  1.16 1.56  TRUE   base
## 13                   $<-     2 0.32  1.76 2.08 FALSE   base
## 14                     :     3 0.02  0.00 0.02  TRUE   base
## 15         as.data.frame     3 0.04  0.02 0.06 FALSE   base
## 16                unlist     3 0.12  0.54 0.66  TRUE   base
## 17        $<-.data.frame     3 1.24  0.84 2.08  TRUE   base
## 18 as.data.frame.integer     4 0.04  0.02 0.06  TRUE   base
profr(f2())
## Read 145 items
##                          f level time start  end  leaf source
## 8                       f2     1 2.88  0.00 2.88 FALSE   <NA>
## 9               data.frame     2 1.40  0.00 1.40 FALSE   base
## 10                       :     2 0.04  1.40 1.44  TRUE   base
## 11                     $<-     2 0.36  1.64 2.00 FALSE   base
## 12                  sample     2 0.40  2.00 2.40  TRUE   base
## 13                     $<-     2 0.36  2.52 2.88 FALSE   base
## 14                       :     3 0.02  0.00 0.02  TRUE   base
## 15                 numeric     3 0.06  0.02 0.08  TRUE   base
## 16               character     3 0.04  0.08 0.12  TRUE   base
## 17           as.data.frame     3 1.06  0.12 1.18 FALSE   base
## 18                  unlist     3 0.20  1.20 1.40  TRUE   base
## 19          $<-.data.frame     3 1.24  1.64 2.88  TRUE   base
## 20   as.data.frame.integer     4 0.04  0.12 0.16  TRUE   base
## 21   as.data.frame.numeric     4 0.16  0.18 0.34  TRUE   base
## 22 as.data.frame.character     4 0.78  0.40 1.18 FALSE   base
## 23                  factor     5 0.74  0.40 1.14 FALSE   base
## 24    as.data.frame.vector     5 0.04  1.14 1.18  TRUE   base
## 25                  unique     6 0.38  0.40 0.78 FALSE   base
## 26                   match     6 0.32  0.78 1.10  TRUE   base
## 27          unique.default     7 0.38  0.40 0.78  TRUE   base
profr(f3())
## Read 37 items
##                        f level time start  end  leaf source
## 8                     f3     1 0.72  0.00 0.72 FALSE   <NA>
## 9             data.frame     2 0.10  0.00 0.10 FALSE   base
## 10                     :     2 0.02  0.10 0.12  TRUE   base
## 11                   $<-     2 0.08  0.14 0.22 FALSE   base
## 12                sample     2 0.26  0.22 0.48  TRUE   base
## 13                   $<-     2 0.16  0.56 0.72 FALSE   base
## 14                     :     3 0.02  0.00 0.02  TRUE   base
## 15         as.data.frame     3 0.04  0.02 0.06 FALSE   base
## 16                unlist     3 0.02  0.08 0.10  TRUE   base
## 17        $<-.data.frame     3 0.58  0.14 0.72  TRUE   base
## 18 as.data.frame.integer     4 0.04  0.02 0.06  TRUE   base
profr(f4())
## Read 15 items
##               f level time start  end  leaf     source
## 8            f4     1 0.28  0.00 0.28 FALSE       <NA>
## 9    data.table     2 0.02  0.00 0.02 FALSE data.table
## 10            [     2 0.26  0.02 0.28 FALSE       base
## 11            :     3 0.02  0.00 0.02  TRUE       base
## 12 [.data.table     3 0.26  0.02 0.28 FALSE       <NA>
## 13         eval     4 0.26  0.02 0.28 FALSE       base
## 14         eval     5 0.26  0.02 0.28 FALSE       base
## 15            :     6 0.02  0.02 0.04  TRUE       base
## 16       sample     6 0.24  0.04 0.28  TRUE       base
like image 39
mnel Avatar answered Oct 21 '22 03:10

mnel