I have been using a small tab
function for some time, which shows the frequency, percent, and cumulative percent for a vector. The output looks like this
Freq Percent cum
ARSON 462 0.01988893 0.01988893
BURGLARY 22767 0.98011107 1.00000000
23229 1.00000000 NA
The excellent dplyr
package motivated me to update the function. Now I am wondering how I can make the updated version even faster. Here is the old function
tab = function(x,useNA =FALSE) {
k=length(unique(x[!is.na(x)]))+1
if (useNA) k=k+1
tab=array(NA,c(k,3))
colnames(tab)=c("freq.","prob.","cum.")
useNA=ifelse(useNA,"always","no")
rownames(tab)=names(c(table(x,useNA=useNA),""))
tab[-nrow(tab),1]=table(x,useNA=useNA)
tab[-nrow(tab),2]=prop.table(table(x,useNA=useNA))
tab[,3] = cumsum(tab[,2])
if(k>2) tab[nrow(tab),-3]=colSums(tab[-nrow(tab),-3])
if(k==2) tab[nrow(tab),-3]=tab[-nrow(tab),-3]
tab
}
and the new based on dplyr
tab2 = function(x, useNA =FALSE) {
if(!useNA) if(any(is.na(x))) x = na.omit(x)
n = length(x)
out = data.frame(x,1) %.%
group_by(x) %.%
dplyr::summarise(
Freq = length(X1),
Percent = Freq/n
) %.%
dplyr::arrange(x)
ids = as.character(out$x)
ids[is.na(ids)] = '<NA>'
out = select(out, Freq, Percent)
out$cum = cumsum(out$Percent)
class(out)="data.frame"
out = rbind(out,c(n,1,NA))
rownames(out) = c(ids,'')
out
}
Finally, some performance benchmarks:
x1 = c(rep('ARSON',462),rep('BURGLARY',22767))
x2 = c(rep('ARSON',462),rep('BURGLARY',22767),rep(NA,100))
x3 = c(c(1:10),c(1:10),1,4)
x4 = c(rep(c(1:100),500),rep(c(1:50),20),1,4)
library('rbenchmark')
benchmark(tab(x1), tab2(x1), replications=100)[,c('test','elapsed','relative')]
# test elapsed relative
# 1 tab(x1) 1.412 2.307
# 2 tab2(x1) 0.612 1.000
benchmark(tab(x2),tab2(x2), replications=100)[,c('test','elapsed','relative')]
# test elapsed relative
# 1 tab(x2) 1.351 1.475
# 2 tab2(x2) 0.916 1.000
benchmark(tab(x2,useNA=TRUE), tab2(x2,useNA=TRUE), replications=100)[,c('test','elapsed','relative')]
# test elapsed relative
# 1 tab(x2, useNA = TRUE) 1.883 2.282
# 2 tab2(x2, useNA = TRUE) 0.825 1.000
benchmark(tab(x3), tab2(x3), replications=1000)[,c('test','elapsed','relative')]
# test elapsed relative
# 1 tab(x3) 0.997 1.000
# 2 tab2(x3) 2.194 2.201
benchmark(tab(x4), tab2(x4), table(x4), replications=100)[,c('test','elapsed','relative')]
# test elapsed relative
# 1 tab(x4) 19.481 18.714
# 2 tab2(x4) 1.041 1.000
# 3 table(x4) 6.515 6.258
tab2
is faster except for the very short vector. The performance gain becomes evident in the larger vector (see x4
with 51002 obs). It's also faster than table
even thought the function is doing much more.
Now to my question: How can I further improve performance? Creating tables with frequencies and percent is a pretty standard application and a fast implementation is very nice when you work with large datasets.
EDIT: Here is an additional test case with a 2e6 vector (including the data.table
solution proposed below)
x5 = sample(c(1:100),2e6, replace=TRUE)
benchmark(tab(x5), tab2(x5), table(x5), tabdt(x5), replications=100)[,c('test','elapsed','relative')]
# test elapsed relative
# 1 tab(x5) 350.878 19.444
# 2 tab2(x5) 52.917 2.932
# 4 tabdt(x5) 18.046 1.000
# 3 table(x5) 98.429 5.454
As I'm a big fan of library(data.table)
I wrote similar function:
tabdt <- function(x){
n <- length(which(!is.na(x)))
dt <- data.table(x)
out <- dt[, list(Freq = .N, Percent = .N / n), by = x]
out[!is.na(x), CumSum := cumsum(Percent)]
out
}
> benchmark(tabdt(x1), tab2(x1), replications=1000)[,c('test','elapsed','relative')]
test elapsed relative
2 tab2(x1) 5.60 1.879
1 tabdt(x1) 2.98 1.000
> benchmark(tabdt(x2), tab2(x2), replications=1000)[,c('test','elapsed','relative')]
test elapsed relative
2 tab2(x2) 6.34 1.686
1 tabdt(x2) 3.76 1.000
> benchmark(tabdt(x3), tab2(x3), replications=1000)[,c('test','elapsed','relative')]
test elapsed relative
2 tab2(x3) 1.65 1.000
1 tabdt(x3) 2.34 1.418
> benchmark(tabdt(x4), tab2(x4), replications=1000)[,c('test','elapsed','relative')]
test elapsed relative
2 tab2(x4) 14.35 1.000
1 tabdt(x4) 22.04 1.536
And so data.table
approach was faster for x1
and x2
while dplyr
was faster for x3
and x4
. Actually I don't see any room for improvement using these approaches.
p.s. Would you add data.table
keyword to this question? I believe people would love to see dplyr
vs. data.table
performance comparison (see data.table vs dplyr: can one do something well the other can't or does poorly? for example).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With