Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is as.Date slow on a character vector?

Tags:

r

data.table

I started using data.table package in R to boost performance of my code. I am using the following code:

sp500 <- read.csv('../rawdata/GMTSP.csv')
days <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")

# Using data.table to get the things much much faster
sp500 <- data.table(sp500, key="Date")
sp500 <- sp500[,Date:=as.Date(Date, "%m/%d/%Y")]
sp500 <- sp500[,Weekday:=factor(weekdays(sp500[,Date]), levels=days, ordered=T)]
sp500 <- sp500[,Year:=(as.POSIXlt(Date)$year+1900)]
sp500 <- sp500[,Month:=(as.POSIXlt(Date)$mon+1)]

I noticed that the conversion done by as.Date function is very slow, when compared to other functions that create weekdays, etc. Why is that? Is there a better/faster solution, how to convert into date-format? (If you would ask whether I really need the date format, probably yes, because then use ggplot2 to make plots, which work like a charm with this type of data.)

To be more precise

> system.time(sp500 <- sp500[,Date:=as.Date(Date, "%m/%d/%Y")])
   user  system elapsed 
 92.603   0.289  93.014 
> system.time(sp500 <- sp500[,Weekday:=factor(weekdays(sp500[,Date]), levels=days, ordered=T)])
   user  system elapsed 
  1.938   0.062   2.001 
> system.time(sp500 <- sp500[,Year:=(as.POSIXlt(Date)$year+1900)])
   user  system elapsed 
  0.304   0.001   0.305 

On MacAir i5 with slightly less then 3000000 observations.

like image 519
krhlk Avatar asked Oct 08 '12 17:10

krhlk


4 Answers

As others mentioned, strptime (converting from character to POSIXlt) is the bottleneck here. Another simple solution uses the lubridate package and its fast_strptime method instead.

Here's what it looks like on my data:

> tables()
     NAME      NROW  MB COLS                                     
[1,] pp   3,718,339 126 session_id,date,user_id,path,num_sessions
     KEY         
[1,] user_id,date
Total: 126MB

> pp[, 2]
               date
      1: 2013-09-25
      2: 2013-09-25
      3: 2013-09-25
      4: 2013-09-25
      5: 2013-09-25
     ---           
3718335: 2013-09-25
3718336: 2013-09-25
3718337: 2013-09-25
3718338: 2013-10-11
3718339: 2013-10-11

> system.time(pp[, date := as.Date(fast_strptime(date, "%Y-%m-%d"))])
   user  system elapsed 
  0.315   0.026   0.344  

For comparison:

> system.time(pp[, date := as.Date(date, "%Y-%m-%d")])
   user  system elapsed 
108.193   0.399 108.844 

That's ~316 times faster!

like image 84
daniel.s Avatar answered Nov 08 '22 01:11

daniel.s


I think it's just that as.Date converts character to Date via POSIXlt, using strptime. And strptime is very slow, I believe.

To trace it through yourself, type as.Date, then methods(as.Date), then look at the character method.

> as.Date
function (x, ...) 
UseMethod("as.Date")
<bytecode: 0x2cf4b20>
<environment: namespace:base>

> methods(as.Date)
[1] as.Date.character as.Date.date      as.Date.dates     as.Date.default  
[5] as.Date.factor    as.Date.IDate*    as.Date.numeric   as.Date.POSIXct  
[9] as.Date.POSIXlt  
   Non-visible functions are asterisked

> as.Date.character
function (x, format = "", ...) 
{
    charToDate <- function(x) {
        xx <- x[1L]
        if (is.na(xx)) {
            j <- 1L
            while (is.na(xx) && (j <- j + 1L) <= length(x)) xx <- x[j]
            if (is.na(xx)) 
                f <- "%Y-%m-%d"
        }
        if (is.na(xx) || !is.na(strptime(xx, f <- "%Y-%m-%d", 
            tz = "GMT")) || !is.na(strptime(xx, f <- "%Y/%m/%d", 
            tz = "GMT"))) 
            return(strptime(x, f))
        stop("character string is not in a standard unambiguous format")
    }
    res <- if (missing(format)) 
        charToDate(x)
    else strptime(x, format, tz = "GMT")       ####  slow part, I think  ####
    as.Date(res)
}
<bytecode: 0x2cf6da0>
<environment: namespace:base>
> 

Why is as.POSIXlt(Date)$year+1900 relatively fast? Again, trace it through :

> as.POSIXct
function (x, tz = "", ...) 
UseMethod("as.POSIXct")
<bytecode: 0x2936de8>
<environment: namespace:base>

> methods(as.POSIXct)
[1] as.POSIXct.date    as.POSIXct.Date    as.POSIXct.dates   as.POSIXct.default
[5] as.POSIXct.IDate*  as.POSIXct.ITime*  as.POSIXct.numeric as.POSIXct.POSIXlt
   Non-visible functions are asterisked

> as.POSIXlt.Date
function (x, ...) 
{
    y <- .Internal(Date2POSIXlt(x))
    names(y$year) <- names(x)
    y
}
<bytecode: 0x395e328>
<environment: namespace:base>
> 

Intrigued, let's dig into Date2POSIXlt. For this bit we need to grep main/src to know which .c file to look at.

~/R/Rtrunk/src/main$ grep Date2POSIXlt *
names.c:{"Date2POSIXlt",do_D2POSIXlt,   0,  11, 1,  {PP_FUNCALL, PREC_FN,   0}},
$

Now we know we need to look for D2POSIXlt :

~/R/Rtrunk/src/main$ grep D2POSIXlt *
datetime.c:SEXP attribute_hidden do_D2POSIXlt(SEXP call, SEXP op, SEXP args, SEXP env)
names.c:{"Date2POSIXlt",do_D2POSIXlt,   0,  11, 1,  {PP_FUNCALL, PREC_FN,   0}},
$

Oh, we could have guessed datetime.c. Anyway, so looking at latest live copy :

datetime.c

Search in there for D2POSIXlt and you'll see how simple it is to go from Date (numeric) to POSIXlt. You'll also see how POSIXlt is one real vector (8 bytes) plus seven integer vectors (4 bytes each). That's 40 bytes, per date!

So the crux of the issue (I think) is why strptime is so slow, and maybe that can be improved in R. Or just avoid POSIXlt, either directly or indirectly.


Here's a reproducible example using the number of items stated in question (3,000,000) :

> Range = seq(as.Date("2000-01-01"),as.Date("2012-01-01"),by="days")
> Date = format(sample(Range,3000000,replace=TRUE),"%m/%d/%Y")
> system.time(as.Date(Date, "%m/%d/%Y"))
   user  system elapsed 
 21.681   0.060  21.760 
> system.time(strptime(Date, "%m/%d/%Y"))
   user  system elapsed 
 29.594   8.633  38.270 
> system.time(strptime(Date, "%m/%d/%Y", tz="GMT"))
   user  system elapsed 
 19.785   0.000  19.802 

Passing tz appears to speed up strptime, which as.Date.character does. So maybe it depends on your locale. But strptime appears to be the culprit, not data.table. Perhaps rerun this example and see if it takes 90 seconds for you on your machine?

like image 37
Matt Dowle Avatar answered Nov 08 '22 02:11

Matt Dowle


Thanks for the suggestions. I solved it by writing the Gaussian algorithm for the dates myself and got far better results, see below.

getWeekDay <- function(year, month, day) {
  # Implementation of the Gaussian algorithm to get weekday 0 - Sunday, ... , 7 - Saturday
  Y <- year
  Y[month<3] <- (Y[month<3] - 1)

  d <- day
  m <- ((month + 9)%%12) + 1
  c <- floor(Y/100)
  y <- Y-c*100
  dayofweek <- (d + floor(2.6*m - 0.2) + y + floor(y/4) + floor(c/4) - 2*c) %% 7
  return(dayofweek)
}

sp500 <- read.csv('../rawdata/GMTSP.csv')
days <- c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday")

# Using data.table to get the things much much faster
sp500 <- data.table(sp500, key="Date")
sp500 <- sp500[,Month:=as.integer(substr(Date,1,2))]
sp500 <- sp500[,Day:=as.integer(substr(Date,4,5))]
sp500 <- sp500[,Year:=as.integer(substr(Date,7,10))]
#sp500 <- sp500[,Date:=as.Date(Date, "%m/%d/%Y")]
#sp500 <- sp500[,Weekday:=factor(weekdays(sp500[,Date]), levels=days, ordered=T)]
sp500 <- sp500[,Weekday:=factor(getWeekDay(Year, Month, Day))]
levels(sp500$Weekday) <- days

Running the whole block above gives (including reading the date from csv)... Data.table is truly impressive.

user  system elapsed 
 19.074   0.803  20.284 

Timing of the conversion itself is 3.49 elapsed.

like image 8
krhlk Avatar answered Nov 08 '22 02:11

krhlk


This is an old question, but I think this tiny trick it might be useful. If you have multiple rows with the same date, you can do

data[, date := as.Date(date[1]), by = date]

It's much faster since it only processes each date once (in my dataset of 40 million rows it goes from 25 seconds to 0.5 seconds).

like image 8
Elio Campitelli Avatar answered Nov 08 '22 03:11

Elio Campitelli