Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R ~ Vectorization of a user defined function

I need to write a function that will count the number of working days (minus weekends, and a vector of other local bank holidays), but the problem I'm coming up against is more simply illustrated with just counting the number of weekdays.

Here is a function that will give the number of weekdays between two dates:

removeWeekends <- function(end, start){

  range <- as.Date(start:end, "1970-01-01")

  range<- range[sapply(range, function(x){
                                if(!chron::is.weekend(x)){
                                  return(TRUE)
                                }else{
                                  return(FALSE)
                                }
                              })]

  return(NROW(range))

}

Which works when it is given a single date for each argument:

removeWeekends(as.Date("2018-05-08"), as.Date("2018-06-08"))
#[1] 24

But when it is given a two vectors from a data frame it fails:

one <- as.Date("2017-01-01"):as.Date("2017-01-08")
two <- as.Date("2018-06-08"):as.Date("2018-06-15")
df <- data.frame(one, two)
removeWeekends(df$two, df$one)
#[1] 375
#Warning messages:
#1: In start:end : numerical expression has 8 elements: only the first used
#2: In start:end : numerical expression has 8 elements: only the first used

I've also tried (which I guessed would not work as the syntax seems off):

lapply(df, removeWeekends, df$two, df$one)
#Error in FUN(X[[i]], ...) : unused argument (17167:17174)

And:

lapply(df[,c("two", "one")], removeWeekends)
#Error in as.Date(start:end, "1970-01-01") :   argument "start" is missing,
# with no default 

I'm assuming it is me misunderstanding the concept of vectorization.

The only other idea I've got is nesting the function within a conditional to see if it's a vector, then calling an apply function on it if it is although I'm not quite sure how I would structure that either.

like image 778
Nick Avatar asked Jun 08 '18 18:06

Nick


People also ask

What does it mean for a function to be vectorized in R?

Most of R's functions are vectorized, meaning that the function will operate on all elements of a vector without needing to loop through and act on each element one at a time. This makes writing code more concise, easy to read, and less error prone.

Why do you vectorize a function?

The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop. If otypes is not specified, then a call to the function with the first argument will be used to determine the number of outputs.

How do you call a custom function in R?

The basic syntax for a custom R function is FunctionName = function(Argument(s)) {Statement(s)} . All functions are assigned a name FunctionName; they end up as objects in your workspace, and are implemented by name. Argument(s) represented the input data objects, which can range for one to several.


2 Answers

You have couple of options to support vectorized argument in function. Since, you have already written your function, the easiest option would be to use Vectorize and convert your function to support vectorized arguments. Another, option is to modify your function and re-write it to support vectorized arguments.

Option#1: Using Vectorize

# Function will support vectorized argument with single statement
vremoveWeekends  <- Vectorize(removeWeekends)

# Try vremoveWeekends  function 
df$dayswithoutweekends <- vremoveWeekends(df$two, df$one)

Option#2: Re-write function to support vectorized arguments. I'll prefer this option since, OP got two arguments which are expected to be of same length. Hence, it will be easier to perform error checking on arguments if we re-write it.

# Modified function 
removeWeekendsNew <- function(end, start){
  if(length(start) != length(end)){
    return(0L)  #Error condition
  }
  result <- rep(0L, length(start)) #store the result for each row

  #One can use mapply instead of for-loop. But for-loop will be faster
  for(i in seq_along(start)){     
    range      = seq(start[i], end[i], by="day")
    result[i]  = length(range[!chron::is.weekend(range)])
  }

  return(result)
}

#Use new function:
df$dayswithoutweekends <- removeWeekendsNew(df$two, df$one)

Result: It's same for both options mentioned above.

df
#          one        two dayswithoutweekends
# 1 2017-01-01 2018-06-08                 375
# 2 2017-01-02 2018-06-09                 375
# 3 2017-01-03 2018-06-10                 374
# 4 2017-01-04 2018-06-11                 374
# 5 2017-01-05 2018-06-12                 374
# 6 2017-01-06 2018-06-13                 374
# 7 2017-01-07 2018-06-14                 374
# 8 2017-01-08 2018-06-15                 375

Data:

one <- seq(as.Date("2017-01-01"),as.Date("2017-01-08"), by="day")
two <- seq(as.Date("2018-06-08"),as.Date("2018-06-15"), by="day")
df <- data.frame(one, two)
df
#          one        two
# 1 2017-01-01 2018-06-08
# 2 2017-01-02 2018-06-09
# 3 2017-01-03 2018-06-10
# 4 2017-01-04 2018-06-11
# 5 2017-01-05 2018-06-12
# 6 2017-01-06 2018-06-13
# 7 2017-01-07 2018-06-14
# 8 2017-01-08 2018-06-15
like image 93
MKR Avatar answered Oct 11 '22 08:10

MKR


If you want to fully vectorize this, you will need to think out of the box. What chron::is.weekend does is just checking how many days were Sundays and Saturdays in a certain time preiod. We can calculate this ourselves in a vectorized way because each week has two weekends, and the only tricky part are the left overs.

I wrote the following function to achieve this, though I'm sure it could be improved

frw <- function(two, one) {

  diff_d <- two - one ## difference in days
  l_d <- (two + 4L) %% 7L + 1L ## last day of the remainder 
  weeks <- diff_d %/% 7L ## number of weeks between
  days <- diff_d %% 7L ## days left

  ## calculate how many work days left
  diff_d - 
    ((weeks * 2L) + ((l_d - days < 1) + ((l_d - days < 2) - (l_d == 1L))) +
    (l_d %in% c(1L, 7L))) + 1L

}

You can run it as follows

frw(two, one)
## [1] 375 375 374 374 374 374 374 375

It is by far faster than the mapply version (almost instant), some benchmark on a bigger data:

one <- as.Date("2017-01-01"):as.Date("2030-01-08")
two <- as.Date("2017-05-01"):as.Date("2030-05-08")
df <- data.frame(one, two)

system.time(res_mapply <- vremoveWeekends(df$two, df$one)) # taken from the other answer
#  user  system elapsed 
# 76.46    0.06   77.25 

system.time(res_vectorized <- frw(df$two, df$one))
# user  system elapsed 
#    0       0       0

identical(res_mapply, res_vectorized)
# [1] TRUE
like image 41
David Arenburg Avatar answered Oct 11 '22 09:10

David Arenburg