I wrote this function which I use all the time:
# Give the previous day, or Friday if the previous day is Saturday or Sunday.
previous_business_date_if_weekend = function(my_date) {
if (length(my_date) == 1) {
if (weekdays(my_date) == "Sunday") { my_date = lubridate::as_date(my_date) - 2 }
if (weekdays(my_date) == "Saturday") { my_date = lubridate::as_date(my_date) - 1 }
return(lubridate::as_date(my_date))
} else if (length(my_date) > 1) {
my_date = lubridate::as_date(sapply(my_date, previous_business_date_if_weekend))
return(my_date)
}
}
Problems arise when I apply it to a date column of a dataframe with thousands of rows. It's ridiculously slow. Any thoughts as to why?
" Lubridate has an inbuilt very fast POSIX parser, ported from the fasttime package by Simon Urbanek. This functionality is as yet optional and could be activated with options(lubridate. fasttime = TRUE). Lubridate will automatically detect POSIX strings and use fast parser instead of the default strptime utility. "
Lubridate is an R package that makes it easier to work with dates and times. Below is a concise tour of some of the things lubridate can do for you. Lubridate was created by Garrett Grolemund and Hadley Wickham, and is now maintained by Vitalie Spinu.
Lubridate is just kind of slow in my experience. I suggest working with data.table and iDate.
Something like this should be pretty robust:
library(data.table)
#Make data.table of dates in string format
x = data.table(date = format(Sys.Date() + 0:100000,format='%d/%m/%Y'))
#Convert to IDate (by reference)
set(x, j = "date", value = as.IDate(strptime(x[,date], "%d/%m/%Y")))
#Day zero was a Thursday
originDate = as.IDate(strptime("01/01/1970", "%d/%m/%Y"))
as.integer(originDate)
#[1] 0
weekdays(originDate)
#[1] "Thursday"
previous_business_date_if_weekend_dt = function(x) {
#Adjust dates so that Sat is 1, Sun is 2, and subtract by reference
x[,adjustedDate := date]
x[(as.integer(x[,date]-2) %% 7 + 1)<=2, adjustedDate := adjustedDate - (as.integer(date-2) %% 7 + 1)]
}
bizdays <- function(x) x - match(weekdays(x), c("Saturday","Sunday"), nomatch=0)
system.time(bizdays(y))
# user system elapsed
# 0.22 0.00 0.22
system.time(previous_business_date_if_weekend_dt(x))
# user system elapsed
# 0 0 0
Also note that the part that takes the most time in this solution is probably pulling the dates from a string, you could reformat them to an integer format if you're concerned about that.
OP's question Why are my functions on lubridate dates so slow? and some generalizing statements like Lubridate is just kind of slow in my experience suggest that a particular package might be the cause for low performance.
I want to verify this with some benchmarks.
::
Frank mentioned in his comment that there is a penalty in using the double colon operator ::
to access exported variables or functions in a namespace.
# creating data
n <- 10^1L
fmt <- "%F"
chr_dates <- format(Sys.Date() + seq_len(n), "%F")
# loading lubridate into namespace
library(lubridate)
microbenchmark::microbenchmark(
base1 = r1 <- as.Date(chr_dates),
base2 = r2 <- base::as.Date(chr_dates),
lubr1 = r3 <- as_date(chr_dates),
lubr2 = r4 <- lubridate::as_date(chr_dates),
times = 100L
)
Unit: microseconds expr min lq mean median uq max neval cld base1 87.977 89.1100 92.03587 89.865 90.9980 128.756 100 a base2 94.018 95.7175 100.64848 97.039 99.3045 179.351 100 b lubr1 92.508 94.2070 98.21307 95.151 97.7940 175.954 100 b lubr2 101.569 103.0800 109.98974 104.024 107.9885 258.643 100 c
The penalty for using the double colon operator ::
is about 10 microseconds.
This only matters if a function is called repeatedly (as it happens in OP's code using sapply()
). IMHO, the pain of debugging namespace conflicts or maintaining code where the origin of functions is unclear is much higher. Your mileage may vary, of course.
The timings can be verified for n = 100
,
Unit: microseconds expr min lq mean median uq max neval cld base1 556.933 561.0855 580.3382 562.9730 590.7250 812.176 100 a base2 564.483 568.2600 588.5695 570.9030 596.2010 989.262 100 a lubr1 562.596 565.9935 587.4443 568.4480 594.8790 1039.480 100 a lubr2 572.036 575.9995 597.1557 578.4545 601.1085 1230.159 100 a
There is a number of packages which deal with the conversion of character dates given in different formats to class Date
or POSIXct
. Some of them aim at performance, others at convenience.
Here, base
, lubridate
, anytime
, fasttime
, and data.table
(because it was mentioned in one of the answers) are compared.
Input are character dates in the standard unambiguous format YYYY-MM-DD
. Time zones are ignored.
fasttime
accepts only dates between 1970 and 2199, so the creation of sample data had to be modified in order to create a sample data set of 100 K dates.
n <- 10^5L
fmt <- "%F"
set.seed(123L)
chr_dates <- format(
sample(
seq(as.Date("1970-01-01"), as.Date("2199-12-31"), by = 1L),
n, replace = TRUE),
"%F")
Because Frank had suspected that guessing formats could add a penalty, the functions are called with and without given format where possible. All functions are called using the double colon operator ::
.
microbenchmark::microbenchmark(
base_ = r1 <- base::as.Date(chr_dates),
basef = r1 <- base::as.Date(chr_dates, fmt),
lub1_ = r2 <- lubridate::as_date(chr_dates),
lub1f = r2 <- lubridate::as_date(chr_dates, fmt),
lub2_ = r3 <- lubridate::ymd(chr_dates),
anyt_ = r4 <- anytime::anydate(chr_dates),
idat_ = r5 <- data.table::as.IDate(chr_dates),
idatf = r5 <- data.table::as.IDate(chr_dates, fmt),
fast_ = r6 <- fasttime::fastPOSIXct(chr_dates),
fastd = r6 <- as.Date(fasttime::fastPOSIXct(chr_dates)),
times = 5L
)
# check results
all.equal(r1, r2)
all.equal(r1, r3)
all.equal(r1, c(r4)) # remove tzone attribute
all.equal(r1, as.Date(r5)) # convert IDate to Date
all.equal(r1, as.Date(r6)) # convert POSIXct to Date
Unit: milliseconds expr min lq mean median uq max neval cld base_ 641.799082 645.008517 648.128466 648.791875 649.149444 655.893411 5 d basef 69.377419 69.937371 73.888828 71.403139 76.022083 82.704127 5 b lub1_ 644.199361 645.217696 680.542327 649.855896 652.887492 810.551189 5 d lub1f 69.769726 69.947943 70.944605 70.795234 71.365759 72.844364 5 b lub2_ 18.672495 27.025711 26.990218 28.180730 29.944409 31.127747 5 ab anyt_ 381.870316 384.513758 386.211134 384.992152 385.159043 394.520400 5 c idat_ 643.386808 644.312259 649.385356 648.204359 651.666396 659.356958 5 d idatf 69.844109 71.188673 75.319481 77.142365 78.156923 80.265334 5 b fast_ 4.994637 5.363533 5.748137 5.601031 5.760370 7.021112 5 a fastd 5.230625 6.296157 6.686500 6.345998 6.538941 9.020780 5 a
The timings show that
as.Date()
, as_date()
, and as.IDate()
is ten times faster than calling without.fasttime::fastPOSIXct()
is the fastest, indeed. Even with the additional conversion from POSIXct
to Date
it is four times faster than the second fastest lubridate::ymd()
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With