Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster proportion tables in R

Tags:

performance

r

xts

I am creating proportion tables based on an xts object. As this is part of a large program that (unfortunately) requires around 10^6 loops, it creates quite a bottleneck and I would like to speed it up.

This is an example of what I started with:

library(quantmod)

test.xts <- xts(sample(seq(1,5, by=.5), 50, replace=T), as.Date(1:50))

system.time(for(i in 1:10000){

  prop.table(table(test.xts))

})

>user  system elapsed 
 19.86    0.00   18.58 

I have already changed the xts to a matrix and that resulted in a significant speed increase. I only mention that it is originally an xts in case I'm missing something with xts that would speed this up beyond the gains I've already seen converting it to a matrix.

test.mat <- as.matrix(test.xts)

system.time(for(i in 1:10000){

  prop.table(table(test.mat))

})

>user  system elapsed 
 2.78    0.00    2.90 

But I'd really like it to be as fast as possible so I'm hoping that others out there have suggestions for further improvements. I'm hoping there's an obvious approach I'm overlooking.

One additional piece of information is that the output from these tables is ultimately merged with similar output from a different time period, so the dimensions need to remain named. (I.e., I need to be able to match the proportion for a value of '10' at time 1 with the proportion of '10' at time 2).

Any help is greatly appreciated.

like image 863
Bryan S Avatar asked May 18 '12 16:05

Bryan S


2 Answers

table() implicitly creates a factor which is expensive. In your case you can save a lot (more than 10x) by using tabulate() since you already have integers:

a <- tabulate(test.mat)
names(a) <- seq_along(a)
a / sum(a)
   1    2    3    4    5    6    7    8    9   10 
0.16 0.14 0.08 0.14 0.08 0.16 0.02 0.06 0.10 0.06 

timings:

system.time(for(i in 1:10000){
  a <- tabulate(test.mat)
  names(a) <- seq_along(a)
  a/sum(a)
})

 user  system elapsed 
0.208   0.002   0.210 

your timing for comparison:

system.time(for(i in 1:10000) prop.table(table(test.mat)))
 user  system elapsed 
3.373   0.028   3.402 
like image 63
Simon Urbanek Avatar answered Nov 10 '22 21:11

Simon Urbanek


TO build on joran's comment, using tabulate() directly can prove faster. It does have three quirks worth noting:

  1. It only deals with integers and truncates decimals.
  2. It silently ignores all negative values and zeros.
  3. It creates a bin for all values 1:n, even if there are zero counts

See ?tabulate for details.

With that caveat, here's a function that gives ~9x speed up:

prop2 <- function(x){
  x <- tabulate(x)
  out <- x/sum(x)
  names(out) <- seq_along(out)
  return(out)
}

Test speed:

library(rbenchmark)
test.mat <- as.matrix(test.xts)
f1 <- function() prop.table(table(test.mat))
benchmark(f1(), prop2(test.mat),
           replications = 1000,
           columns = c("test", "relative", "elapsed"),
           order = "relative")
#------
             test relative elapsed
2 prop2(test.mat)      1.0    0.10
1            f1()      9.1    0.91

Confirm output is same:

> prop.table(table(test.mat))
test.mat
   1    2    3    4    5    6    7    8    9   10 
0.04 0.02 0.20 0.12 0.08 0.10 0.06 0.14 0.12 0.12 
> prop2(test.mat)
   1    2    3    4    5    6    7    8    9   10 
0.04 0.02 0.20 0.12 0.08 0.10 0.06 0.14 0.12 0.12
like image 27
Chase Avatar answered Nov 10 '22 19:11

Chase