Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up quantile calculation

I am using the Hmisc Package to calculate the quantiles of two continous variables and compare the results in a crosstable. You find my code below.

My problem is that the calculation of the quantiles takes a considerable amount of time if the number of observations increases.

Is there any possibility to speed up this procedure by using the data.table, ddply or any other package?

Thanks.

library(Hmisc)

# Set seed
set.seed(123)

# Generate some data
a <- sample(1:25, 1e7, replace=TRUE)
b <- sample(1:25, 1e7, replace=TRUE)
c <- data.frame(a,b)

# Calculate quantiles
c$a.quantile <- cut2(a, g=5)
c$b.quantile <- cut2(b, g=5)

# Output some descriptives
summaryM(a.quantile ~ b.quantile, data=c, overall=TRUE)

# Time spent for calculation:
#       User      System verstrichen 
#      25.13        3.47       28.73 
like image 420
majom Avatar asked Oct 31 '25 22:10

majom


1 Answers

As stated by jlhoward and Ricardo Saporta data.table doesn't seem to speed up things too much in this case. The cut2 function is clearly the bottleneck here. I used another function to calculate the quantiles (see Is there a better way to create quantile "dummies" / factors in R?) and was able to decrease the calculation time by half:

qcut <- function(x, n) {
  if(n<=2)
    { 
    stop("The sample must be split in at least 3 parts.")
  }
  else{
    break.values <- quantile(x, seq(0, 1, length = n + 1), type = 7)
    break.labels <- c(
      paste0(">=",break.values[1], " & <=", break.values[2]),
      sapply(break.values[3:(n)], function(x){paste0(">",break.values[which(break.values == x)-1], " & <=", x)}),
      paste0(">",break.values[(n)], " & <=", break.values[(n+1)]))
    cut(x, break.values, labels = break.labels,include.lowest = TRUE)
  }
}

c$a.quantile.2 <- qcut(c$a, 5)
c$b.quantile.2 <- qcut(c$b, 5)
summaryM(a.quantile.2 ~ b.quantile.2, data=c, overall=TRUE)

# Time spent for calculation:
#       User      System verstrichen 
#      10.22        1.47       11.70 

Using data.table would reduce the calculation time by another second, but I like the summary by the Hmisc package better.

like image 61
majom Avatar answered Nov 03 '25 12:11

majom