This has been asked several times with no clear answer: I would like to convert an R character string of the form "YYYY-mm-dd" into a Date
. The as.Date
function is exceedingly slow. convert character to date *quickly* in R provides a solution using fasttime
that works for dates from 1970 onward. My issue is I have dates starting from 1900 that I need to convert and there are about 100 million of them. I have to do this frequently so the speed is important. Are there any other solutions?
Consider incredibly fast anytime
library which is fine with 1970< issue. It uses the Boost date_time C++ library and provides functions anytime()
and anydate()
for conversions. Comparison:
require(anytime) #anydate()
require(lubridate) #parse_date_time()
require(microbenchmark) #microbenchmark()
set.seed(21)
test.dd <- as.Date("2018-05-16") - sample(40000, 1e6, TRUE) #1 mln. random dates
microbenchmark(
strptime(test.dd, "%Y-%m-%d"), #basic strptime
parse_date_time(test.dd, orders = "ymd"), #lubridate (POSIXct class)
as.Date(parse_date_time(test.dd, orders = "ymd")), #lubridate + date class conversion
anydate(test.dd), #anytime library
times = 10L, unit = "s"
)
Result/Output:
Unit: seconds
expr min lq mean median uq max neval cld
strptime(test.dd, "%Y-%m-%d") 10.177406012 10.472527403 1.064532e+01 10.621221596 10.819156870 11.288330598 10 c
parse_date_time(test.dd, orders = "ymd") 4.541542019 4.603663894 4.844961e+00 4.869800287 5.055844972 5.128409226 10 b
as.Date(parse_date_time(test.dd, orders = "ymd")) 4.461140695 4.568415584 4.867837e+00 4.739026273 5.080610126 5.532028490 10 b
anydate(test.dd) 0.000000755 0.000004909 5.777500e-06 0.000005664 0.000006042 0.000012839 10 a
p.s. For working with time series consider flipTime
library. It has all required tools and almost as fast as anytime
for conversion purposes:
require(devtools)
install_github("Displayr/flipTime")
I had a similar problem a while ago and came up with the following solution:
Extending Joshua Ulrich's example, I get (with slower timings on my laptop)
library(date)
set.seed(21)
x <- as.character(Sys.Date()-sample(40000, 1e6, TRUE))
system.time(dDate <- as.Date(x))
# user system elapsed
# 12.09 0.00 12.12
system.time(ddate <- as.Date(as.date(x,"ymd")))
# user system elapsed
# 6.97 0.04 7.05
system.time({
xf <- as.factor(x)
dDate <- as.Date(levels(xf))[as.integer(xf)]
})
# user system elapsed
# 1.16 0.00 1.15
Here, step 2 does not depend on the length of x once x is large enough and step 3 scales extremely well (simple vector indexing). The bottleneck should be step 1, which can be avoided if the data is already stored as a factor.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With