Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mystery: Why does the as.character() function in a data.table run faster if I add and subtract another variable?

Tags:

r

data.table

I noticed something very peculiar when converting dates to character classes for large data sets. As an example, I have created a mock data set as follows:

DT = data.table(x=rep("2007-1-1", 1e9), y = rep(1,1e9))
DT[,x] <- as.Date(DT[,x])

Now, I would like to convert the x column of dates from a date format to character.

DT[,x.character:= as.character(x)] 

This takes a bit of time for large data sets and I noticed that the time it takes to convert decreases dramatically if we did the following:

DT[,x.character:= as.character(x+y-y)]

All I did here was add y and subtract y, so I really am just getting the same results. From a logical standpoint, it seems like I am making the computer do more work. However, is there a reason why this method would result in a faster run than the straight conversion way?

For illustrative purposes, I ran these processes twice with 10000 rows with system.time() and got these results:

DT = data.table(x=rep(as.Date("2007-1-1"), 1e5), y = rep(1,1e5))

system.time(DT[,x.character:= as.character(x)]) 
> user  system elapsed 
1.89    0.12    2.03 

system.time(DT[,x.character:= as.character(x+y-y)]) 
> user  system elapsed 
0.635   0.008   0.643 

system.time(DT[,x.character.sub:= as.character(x+y-y+y-y)]) 
> user  system elapsed 
0.347   0.004   0.351 

As we can see, the second method results in less time needed, and more interestingly, the third method, with more of the y-y method, results in even less time. Is there a reason why?

Thank you!

like image 450
user1398057 Avatar asked Dec 11 '22 03:12

user1398057


1 Answers

It's faster the second time you call as.character during the R session because all the characters have been added to the global cache. Adding and subtracting another variable is not relevant.

> library(data.table)
data.table 1.9.3  For help type: help("data.table")
> DT = data.table(x=rep(as.Date("2007-1-1"), 1e5), y = rep(1,1e5))
> system.time(DT[,x.character := as.character(x)]) 
   user  system elapsed 
  0.572   0.012   0.584 
> system.time(DT[,x.character := as.character(x)]) 
   user  system elapsed 
  0.389   0.008   0.397 
> system.time(DT[,x.character := as.character(x)]) 
   user  system elapsed 
  0.332   0.004   0.337 

To further the point, this doesn't even have anything to do with data.table. From another new session:

> x <- rep(as.Date("2007-1-1"), 1e5)
> system.time(as.character(x)) 
   user  system elapsed 
  0.529   0.008   0.537 
> system.time(as.character(x)) 
   user  system elapsed 
  0.312   0.012   0.324 
> system.time(as.character(x)) 
   user  system elapsed 
  0.327   0.008   0.335 
like image 94
Joshua Ulrich Avatar answered Apr 30 '23 23:04

Joshua Ulrich