Why are my functions on lubridate dates so slow?

Tags:

I wrote this function which I use all the time:

# Give the previous day, or Friday if the previous day is Saturday or Sunday.
previous_business_date_if_weekend = function(my_date) {
    if (length(my_date) == 1) {
        if (weekdays(my_date) == "Sunday") { my_date = lubridate::as_date(my_date) - 2 }
        if (weekdays(my_date) == "Saturday") { my_date = lubridate::as_date(my_date) - 1 }
        return(lubridate::as_date(my_date))
    } else if (length(my_date) > 1) {
        my_date = lubridate::as_date(sapply(my_date, previous_business_date_if_weekend))
        return(my_date)
    }
}

Problems arise when I apply it to a date column of a dataframe with thousands of rows. It's ridiculously slow. Any thoughts as to why?

977

asked Sep 05 '17 03:09

2 Answers

Lubridate is just kind of slow in my experience. I suggest working with data.table and iDate.

Something like this should be pretty robust:

library(data.table)

#Make data.table of dates in string format
x = data.table(date = format(Sys.Date() + 0:100000,format='%d/%m/%Y'))

#Convert to IDate (by reference)
set(x, j = "date", value = as.IDate(strptime(x[,date], "%d/%m/%Y")))

#Day zero was a Thursday
originDate = as.IDate(strptime("01/01/1970", "%d/%m/%Y"))
as.integer(originDate)
#[1] 0
weekdays(originDate)
#[1] "Thursday"

previous_business_date_if_weekend_dt = function(x) {

  #Adjust dates so that Sat is 1, Sun is 2, and subtract by reference
  x[,adjustedDate := date]
  x[(as.integer(x[,date]-2) %% 7 + 1)<=2, adjustedDate := adjustedDate - (as.integer(date-2) %% 7 + 1)]

}

bizdays <- function(x) x - match(weekdays(x), c("Saturday","Sunday"), nomatch=0)

system.time(bizdays(y))
# user  system elapsed 
# 0.22    0.00    0.22 

system.time(previous_business_date_if_weekend_dt(x))
# user  system elapsed 
# 0       0       0

Also note that the part that takes the most time in this solution is probably pulling the dates from a string, you could reformat them to an integer format if you're concerned about that.

answered Sep 27 '22 18:09

OP's question Why are my functions on lubridate dates so slow? and some generalizing statements like Lubridate is just kind of slow in my experience suggest that a particular package might be the cause for low performance.

I want to verify this with some benchmarks.

Penalty of using the double colon operator `::`

Frank mentioned in his comment that there is a penalty in using the double colon operator :: to access exported variables or functions in a namespace.

# creating data
n <- 10^1L
fmt <- "%F"
chr_dates <- format(Sys.Date() + seq_len(n), "%F")
    
# loading lubridate into namespace
library(lubridate) 
microbenchmark::microbenchmark(
  base1 = r1 <- as.Date(chr_dates),
  base2 = r2 <- base::as.Date(chr_dates),
  lubr1 = r3 <- as_date(chr_dates),
  lubr2 = r4 <- lubridate::as_date(chr_dates),
  times = 100L
)

Unit: microseconds
  expr     min       lq      mean  median       uq     max neval cld
 base1  87.977  89.1100  92.03587  89.865  90.9980 128.756   100 a  
 base2  94.018  95.7175 100.64848  97.039  99.3045 179.351   100  b 
 lubr1  92.508  94.2070  98.21307  95.151  97.7940 175.954   100  b 
 lubr2 101.569 103.0800 109.98974 104.024 107.9885 258.643   100   c

The penalty for using the double colon operator :: is about 10 microseconds.

This only matters if a function is called repeatedly (as it happens in OP's code using sapply()). IMHO, the pain of debugging namespace conflicts or maintaining code where the origin of functions is unclear is much higher. Your mileage may vary, of course.

The timings can be verified for n = 100,

Unit: microseconds
  expr     min       lq     mean   median       uq      max neval cld
 base1 556.933 561.0855 580.3382 562.9730 590.7250  812.176   100   a
 base2 564.483 568.2600 588.5695 570.9030 596.2010  989.262   100   a
 lubr1 562.596 565.9935 587.4443 568.4480 594.8790 1039.480   100   a
 lubr2 572.036 575.9995 597.1557 578.4545 601.1085 1230.159   100   a

Converting character dates to class Date

There is a number of packages which deal with the conversion of character dates given in different formats to class Date or POSIXct. Some of them aim at performance, others at convenience.

Here, base, lubridate, anytime, fasttime, and data.table (because it was mentioned in one of the answers) are compared.

Input are character dates in the standard unambiguous format YYYY-MM-DD. Time zones are ignored.

fasttime accepts only dates between 1970 and 2199, so the creation of sample data had to be modified in order to create a sample data set of 100 K dates.

n <- 10^5L
fmt <- "%F"
set.seed(123L)
chr_dates <- format(
  sample(
    seq(as.Date("1970-01-01"), as.Date("2199-12-31"), by = 1L), 
    n, replace = TRUE),
  "%F")

Because Frank had suspected that guessing formats could add a penalty, the functions are called with and without given format where possible. All functions are called using the double colon operator ::.

microbenchmark::microbenchmark(
  base_ = r1 <- base::as.Date(chr_dates),
  basef = r1 <- base::as.Date(chr_dates, fmt),
  lub1_ = r2 <- lubridate::as_date(chr_dates),
  lub1f = r2 <- lubridate::as_date(chr_dates, fmt),
  lub2_ = r3 <- lubridate::ymd(chr_dates),
  anyt_ = r4 <- anytime::anydate(chr_dates),
  idat_ = r5 <- data.table::as.IDate(chr_dates),
  idatf = r5 <- data.table::as.IDate(chr_dates, fmt),
  fast_ = r6 <- fasttime::fastPOSIXct(chr_dates),
  fastd = r6 <- as.Date(fasttime::fastPOSIXct(chr_dates)),
  times = 5L
)
# check results
all.equal(r1, r2)
all.equal(r1, r3)
all.equal(r1, c(r4)) # remove tzone attribute
all.equal(r1, as.Date(r5)) # convert IDate to Date
all.equal(r1, as.Date(r6)) # convert POSIXct to Date

Unit: milliseconds
  expr        min         lq       mean     median         uq        max neval  cld
 base_ 641.799082 645.008517 648.128466 648.791875 649.149444 655.893411     5    d
 basef  69.377419  69.937371  73.888828  71.403139  76.022083  82.704127     5  b  
 lub1_ 644.199361 645.217696 680.542327 649.855896 652.887492 810.551189     5    d
 lub1f  69.769726  69.947943  70.944605  70.795234  71.365759  72.844364     5  b  
 lub2_  18.672495  27.025711  26.990218  28.180730  29.944409  31.127747     5 ab  
 anyt_ 381.870316 384.513758 386.211134 384.992152 385.159043 394.520400     5   c 
 idat_ 643.386808 644.312259 649.385356 648.204359 651.666396 659.356958     5    d
 idatf  69.844109  71.188673  75.319481  77.142365  78.156923  80.265334     5  b  
 fast_   4.994637   5.363533   5.748137   5.601031   5.760370   7.021112     5 a   
 fastd   5.230625   6.296157   6.686500   6.345998   6.538941   9.020780     5 a

The timings show that

Frank's suspicion is correct. Guessing formats is costly. Passing the format as parameter to as.Date(), as_date(), and as.IDate() is ten times faster than calling without.
fasttime::fastPOSIXct() is the fastest, indeed. Even with the additional conversion from POSIXct to Date it is four times faster than the second fastest lubridate::ymd().

answered Sep 27 '22 19:09

Uwe

Related questions
                            
                                NaiveBayes in R Cannot Predict - factor(0) Levels:
                            
                                Convert decimal day to HH:MM
                            
                                What can cause a “non-unique matches detected” error in an r merge?
                            
                                Earliest Date for each id in R
                            
                                dplyr - filter by group size
                            
                                How to erase all attributes?
                            
                                outer() equivalent for non-vector lists in R
                            
                                How to create an "inkblot" chart with R?
                            
                                Out of memory when modifying a big R data.frame
                            
                                XPath to extract text after br tags in R
                            
                                How can I determine if try returned an error or not?
                            
                                How to generate all possible combinations of vectors without caring for order?
                            
                                Calculating column means based on values in another column [duplicate]
                            
                                Passing a `data.table` to c++ functions using `Rcpp` and/or `RcppArmadillo`
                            
                                Arrange ggplots together in custom ratios and spacing
                            
                                Elegant way to drop rare factor levels from data frame
                            
                                How do you create a 50 state map (instead of just lower-48)
                            
                                Linear Regression and storing results in data frame [duplicate]
                            
                                How to identify the distribution of the given data using r
                            
                                Mutating column in `dplyr` using `rowSums`

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why are my functions on lubridate dates so slow?

Tags:

date

r

lubridate

lebelinoz

People also ask

2 Answers

Matt

Penalty of using the double colon operator `::`

Converting character dates to class Date

Uwe

Recent Activity

Donate For Us

Why are my functions on lubridate dates so slow?

Tags:

date

r

lubridate

lebelinoz

People also ask

2 Answers

Matt

Penalty of using the double colon operator ::

Converting character dates to class Date

Uwe

Related questions

Recent Activity

Donate For Us

Penalty of using the double colon operator `::`