Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

convert string date to R Date FAST for all dates

Tags:

date

posix

r

This has been asked several times with no clear answer: I would like to convert an R character string of the form "YYYY-mm-dd" into a Date. The as.Date function is exceedingly slow. convert character to date *quickly* in R provides a solution using fasttime that works for dates from 1970 onward. My issue is I have dates starting from 1900 that I need to convert and there are about 100 million of them. I have to do this frequently so the speed is important. Are there any other solutions?

like image 767
Alex Avatar asked Jan 08 '13 15:01

Alex


2 Answers

Consider incredibly fast anytime library which is fine with 1970< issue. It uses the Boost date_time C++ library and provides functions anytime() and anydate() for conversions. Comparison:

require(anytime)        #anydate()
require(lubridate)      #parse_date_time()
require(microbenchmark) #microbenchmark()

set.seed(21)
test.dd <- as.Date("2018-05-16") - sample(40000, 1e6, TRUE) #1 mln. random dates

microbenchmark(
    strptime(test.dd, "%Y-%m-%d"),                     #basic strptime
    parse_date_time(test.dd, orders = "ymd"),          #lubridate (POSIXct class)
    as.Date(parse_date_time(test.dd, orders = "ymd")), #lubridate + date class conversion
    anydate(test.dd),                                  #anytime library
    times = 10L, unit = "s"
)

Result/Output:

Unit: seconds
                                             expr          min           lq         mean       median           uq          max neval cld
                    strptime(test.dd, "%Y-%m-%d") 10.177406012 10.472527403 1.064532e+01 10.621221596 10.819156870 11.288330598    10   c
         parse_date_time(test.dd, orders = "ymd")  4.541542019  4.603663894 4.844961e+00  4.869800287  5.055844972  5.128409226    10  b 
as.Date(parse_date_time(test.dd, orders = "ymd"))  4.461140695  4.568415584 4.867837e+00  4.739026273  5.080610126  5.532028490    10  b 
                                 anydate(test.dd)  0.000000755  0.000004909 5.777500e-06  0.000005664  0.000006042  0.000012839    10 a 

p.s. For working with time series consider flipTime library. It has all required tools and almost as fast as anytime for conversion purposes:

require(devtools)
install_github("Displayr/flipTime")
like image 174
George Shimanovsky Avatar answered Oct 17 '22 18:10

George Shimanovsky


I had a similar problem a while ago and came up with the following solution:

  1. convert the string to a factor (if not already a factor)
  2. convert the levels of the factor to a Date
  3. Expand the converted levels to the solution using the index vector of the factor

Extending Joshua Ulrich's example, I get (with slower timings on my laptop)

library(date)
set.seed(21)
x <- as.character(Sys.Date()-sample(40000, 1e6, TRUE))
system.time(dDate <- as.Date(x))
#    user  system elapsed 
#    12.09   0.00   12.12 
system.time(ddate <- as.Date(as.date(x,"ymd")))
#    user  system elapsed 
#    6.97    0.04    7.05 
system.time({
    xf <- as.factor(x)
    dDate <- as.Date(levels(xf))[as.integer(xf)]
})
#    user  system elapsed 
#    1.16    0.00    1.15

Here, step 2 does not depend on the length of x once x is large enough and step 3 scales extremely well (simple vector indexing). The bottleneck should be step 1, which can be avoided if the data is already stored as a factor.

like image 39
Jonas Rauch Avatar answered Oct 17 '22 18:10

Jonas Rauch