Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to obtain frequencies of integers in a vector?

Is there a simple and fast way to obtain the frequency of each integer that occurs in a vector of integers in R?

Here are my attempts so far:

x <- floor(runif(1000000)*1000)

print('*** using TABLE:')
system.time(as.data.frame(table(x)))

print('*** using HIST:')
system.time(hist(x,breaks=min(x):(max(x)+1),plot=FALSE,right=FALSE))

print('*** using SORT')
system.time({cdf<-cbind(sort(x),seq_along(x)); cdf<-cdf[!duplicated(cdf[,1]),2]; c(cdf[-1],length(x)+1)-cdf})

print('*** using ECDF')
system.time({i<-min(x):max(x); cdf<-ecdf(x)(i)*length(x); cdf-c(0,cdf[-length(i)])})

print('*** counting in loop')
system.time({h<-rep(0,max(x)+1);for(i in seq_along(x)){h[x[i]]<-h[x[i]]+1}; h})

#print('*** vectorized summation') #This uses too much memory if x is large
#system.time(colSums(matrix(rbind(min(x):max(x))[rep(1,length(x)),]==x,ncol=max(x)-min(x)+1)))

#Note: There are some fail cases in some of the above methods that need patching if, for example, there is a chance that some integer bins are unoccupied

and here are the results:

[1] "*** using TABLE:"
   user  system elapsed 
   1.26    0.03    1.29 
[1] "*** using HIST:"
   user  system elapsed 
   0.11    0.00    0.10 
[1] "*** using SORT"
   user  system elapsed 
   0.22    0.02    0.23 
[1] "*** using ECDF"
   user  system elapsed 
   0.17    0.00    0.17 
[1] "*** counting in loop"
   user  system elapsed 
   3.12    0.00    3.12 

As you can see table is ridiculously slow and hist seems to be the fastest. But hist (as I am using it) is working on arbitrarily-specifiable breakpoints, whereas I simply want to bin integers. Isn't there a way to trade that flexibility for better performance?

In C, for(i=0;i<1000000;i++)h[x[i]]++; would be blisteringly fast.

like image 808
Museful Avatar asked Aug 23 '13 14:08

Museful


People also ask

How do you find the frequency of a vector?

Count the Frequency of elements in a Numeric Vector – tabulate() Function. tabulate() function in R Language is used to count the frequency of occurrence of a element in the vector. This function checks for each element in the vector and returns the number of times it occurs in the vector.

How do you find the frequency of a number?

To calculate frequency, divide the number of times the event occurs by the length of time. Example: Anna divides the number of website clicks (236) by the length of time (one hour, or 60 minutes). She finds that she receives 3.9 clicks per minute.


3 Answers

The fastest is to use tabulate but it requires positive integers as input, so you have to do a quick monotonic transformation.

set.seed(21)
x <- as.integer(runif(1e6)*1000)
system.time({
  adj <- 1L - min(x)
  y <- setNames(tabulate(x+adj), sort(unique(x)))
})
like image 98
Joshua Ulrich Avatar answered Nov 14 '22 21:11

Joshua Ulrich


Don't forget you can inline C++ code in R.

library(inline)

src <- '
Rcpp::NumericVector xa(a);
int n_xa = xa.size();
int test = max(xa);
Rcpp::NumericVector xab(test);
for (int i = 0; i < n_xa; i++)
xab[xa[i]-1]++;
return xab;
'
fun <- cxxfunction(signature(a = "numeric"),src, plugin = "Rcpp")
like image 32
Joe Avatar answered Nov 14 '22 22:11

Joe


I think tabulate or the C++ versions are the way to go but here's some code using rbenchmark which is a great package for looking at timings (I added a few slower function tests too):

######################
### ---Clean Up--- ###
######################

rm(list = ls())
gc()

######################
### ---Packages--- ###
#####################

require(parallel) 
require(data.table)
require(rbenchmark)
require(inline)


#######################
### ---Functions--- ###
#######################

# Competitor functions by Breyal
Breyal.using_datatable <- function(x) {DT <- data.table(x = x, weight = 1, key = "x"); DT[, length(weight), by = x]}
Breyal.using_lapply_1c_eq <- function(x = sort(x)) { lapply(unique(x), function(u) sum(x == u)) } # 1 core
Breyal.using_mclapply_8c_eq <- function(x = sort(x)) { mclapply(unique(x), function(u) sum(x == u), mc.cores = 8L) } # 8 cores

# Competitor functions by tennenrishin
tennenrishin.using_table <- function(x) as.data.frame(table(x))
tennenrishin.using_hist <- function(x) hist(x,breaks=min(x):(max(x)+1),plot=FALSE,right=FALSE)
tennenrishin.using_sort <- function(x) {cdf<-cbind(sort(x),seq_along(x)); cdf<-cdf[!duplicated(cdf[,1]),2]; c(cdf[-1],length(x)+1)-cdf}
tennenrishin.using_ecdf <- function(x) {i<-min(x):max(x); cdf<-ecdf(x)(i)*length(x); cdf-c(0,cdf[-length(i)])}
tennenrishin.using_counting_loop <- function(x) {h<-rep(0,max(x)+1);for(i in seq_along(x)){h[x[i]]<-h[x[i]]+1}; h}

# Competitor function by Ulrich
Ulrich.using_tabulate <- function(x) {
  adj <- 1L - min(x)
  y <- setNames(tabulate(x+adj), sort(unique(x)))
  return(y)
}

# I couldn't get the Joe's C++ version to work (my laptop won't install inline) butI suspect that would be the fastest solution

##################
### ---Data--- ###
##################

# Set seed so results are reproducable
set.seed(21)

# Data vector
x <- floor(runif(1000000)*1000)


#####################
### ---Timings--- ###
#####################

# Benchmarks using Ubuntu 13.04 x64 with 8GB RAM and i7-2600K CPU @ 3.40GHz
benchmark(replications = 5,
          tennenrishin.using_table(x),
          tennenrishin.using_hist(x),
          tennenrishin.using_sort(x),
          tennenrishin.using_ecdf(x),
          tennenrishin.using_counting_loop(x),
          Ulrich.using_tabulate(x),
          Breyal.using_datatable(x),
          Breyal.using_lapply_1c_eq(x),
          Breyal.using_mclapply_8c_eq(x),
          order = "relative")

Which results in the following timings

                                 test replications elapsed relative user.self sys.self user.child sys.child
6            Ulrich.using_tabulate(x)            5   0.176    1.000     0.176    0.000       0.00     0.000
2          tennenrishin.using_hist(x)            5   0.468    2.659     0.468    0.000       0.00     0.000
3          tennenrishin.using_sort(x)            5   0.687    3.903     0.688    0.000       0.00     0.000
4          tennenrishin.using_ecdf(x)            5   0.749    4.256     0.748    0.000       0.00     0.000
7           Breyal.using_datatable(x)            5   2.960   16.818     2.960    0.000       0.00     0.000
1         tennenrishin.using_table(x)            5   4.651   26.426     4.596    0.052       0.00     0.000
9      Breyal.using_mclapply_8c_eq(x)            5  10.817   61.460     0.140    1.196      54.62     7.112
5 tennenrishin.using_counting_loop(x)            5  10.922   62.057    10.912    0.000       0.00     0.000
8        Breyal.using_lapply_1c_eq(x)            5  36.807  209.131    36.768    0.000       0.00     0.000
like image 26
Tony Breyal Avatar answered Nov 14 '22 22:11

Tony Breyal