What is the fastest way to obtain frequencies of integers in a vector?

Q: How do you find the frequency of a vector?

Count the Frequency of elements in a Numeric Vector – tabulate() Function. tabulate() function in R Language is used to count the frequency of occurrence of a element in the vector. This function checks for each element in the vector and returns the number of times it occurs in the vector.

Q: How do you find the frequency of a number?

To calculate frequency, divide the number of times the event occurs by the length of time. Example: Anna divides the number of website clicks (236) by the length of time (one hour, or 60 minutes). She finds that she receives 3.9 clicks per minute.

Tags:

r

frequency

histogram

ecdf

Is there a simple and fast way to obtain the frequency of each integer that occurs in a vector of integers in R?

Here are my attempts so far:

x <- floor(runif(1000000)*1000)

print('*** using TABLE:')
system.time(as.data.frame(table(x)))

print('*** using HIST:')
system.time(hist(x,breaks=min(x):(max(x)+1),plot=FALSE,right=FALSE))

print('*** using SORT')
system.time({cdf<-cbind(sort(x),seq_along(x)); cdf<-cdf[!duplicated(cdf[,1]),2]; c(cdf[-1],length(x)+1)-cdf})

print('*** using ECDF')
system.time({i<-min(x):max(x); cdf<-ecdf(x)(i)*length(x); cdf-c(0,cdf[-length(i)])})

print('*** counting in loop')
system.time({h<-rep(0,max(x)+1);for(i in seq_along(x)){h[x[i]]<-h[x[i]]+1}; h})

#print('*** vectorized summation') #This uses too much memory if x is large
#system.time(colSums(matrix(rbind(min(x):max(x))[rep(1,length(x)),]==x,ncol=max(x)-min(x)+1)))

#Note: There are some fail cases in some of the above methods that need patching if, for example, there is a chance that some integer bins are unoccupied

and here are the results:

[1] "*** using TABLE:"
   user  system elapsed 
   1.26    0.03    1.29 
[1] "*** using HIST:"
   user  system elapsed 
   0.11    0.00    0.10 
[1] "*** using SORT"
   user  system elapsed 
   0.22    0.02    0.23 
[1] "*** using ECDF"
   user  system elapsed 
   0.17    0.00    0.17 
[1] "*** counting in loop"
   user  system elapsed 
   3.12    0.00    3.12

As you can see table is ridiculously slow and hist seems to be the fastest. But hist (as I am using it) is working on arbitrarily-specifiable breakpoints, whereas I simply want to bin integers. Isn't there a way to trade that flexibility for better performance?

In C, for(i=0;i<1000000;i++)h[x[i]]++; would be blisteringly fast.

808

asked Aug 23 '13 14:08

Museful

3 Answers

The fastest is to use tabulate but it requires positive integers as input, so you have to do a quick monotonic transformation.

set.seed(21)
x <- as.integer(runif(1e6)*1000)
system.time({
  adj <- 1L - min(x)
  y <- setNames(tabulate(x+adj), sort(unique(x)))
})

answered Nov 14 '22 21:11

Joshua Ulrich

Don't forget you can inline C++ code in R.

library(inline)

src <- '
Rcpp::NumericVector xa(a);
int n_xa = xa.size();
int test = max(xa);
Rcpp::NumericVector xab(test);
for (int i = 0; i < n_xa; i++)
xab[xa[i]-1]++;
return xab;
'
fun <- cxxfunction(signature(a = "numeric"),src, plugin = "Rcpp")

answered Nov 14 '22 22:11

Joe

I think tabulate or the C++ versions are the way to go but here's some code using rbenchmark which is a great package for looking at timings (I added a few slower function tests too):

######################
### ---Clean Up--- ###
######################

rm(list = ls())
gc()

######################
### ---Packages--- ###
#####################

require(parallel) 
require(data.table)
require(rbenchmark)
require(inline)


#######################
### ---Functions--- ###
#######################

# Competitor functions by Breyal
Breyal.using_datatable <- function(x) {DT <- data.table(x = x, weight = 1, key = "x"); DT[, length(weight), by = x]}
Breyal.using_lapply_1c_eq <- function(x = sort(x)) { lapply(unique(x), function(u) sum(x == u)) } # 1 core
Breyal.using_mclapply_8c_eq <- function(x = sort(x)) { mclapply(unique(x), function(u) sum(x == u), mc.cores = 8L) } # 8 cores

# Competitor functions by tennenrishin
tennenrishin.using_table <- function(x) as.data.frame(table(x))
tennenrishin.using_hist <- function(x) hist(x,breaks=min(x):(max(x)+1),plot=FALSE,right=FALSE)
tennenrishin.using_sort <- function(x) {cdf<-cbind(sort(x),seq_along(x)); cdf<-cdf[!duplicated(cdf[,1]),2]; c(cdf[-1],length(x)+1)-cdf}
tennenrishin.using_ecdf <- function(x) {i<-min(x):max(x); cdf<-ecdf(x)(i)*length(x); cdf-c(0,cdf[-length(i)])}
tennenrishin.using_counting_loop <- function(x) {h<-rep(0,max(x)+1);for(i in seq_along(x)){h[x[i]]<-h[x[i]]+1}; h}

# Competitor function by Ulrich
Ulrich.using_tabulate <- function(x) {
  adj <- 1L - min(x)
  y <- setNames(tabulate(x+adj), sort(unique(x)))
  return(y)
}

# I couldn't get the Joe's C++ version to work (my laptop won't install inline) butI suspect that would be the fastest solution

##################
### ---Data--- ###
##################

# Set seed so results are reproducable
set.seed(21)

# Data vector
x <- floor(runif(1000000)*1000)


#####################
### ---Timings--- ###
#####################

# Benchmarks using Ubuntu 13.04 x64 with 8GB RAM and i7-2600K CPU @ 3.40GHz
benchmark(replications = 5,
          tennenrishin.using_table(x),
          tennenrishin.using_hist(x),
          tennenrishin.using_sort(x),
          tennenrishin.using_ecdf(x),
          tennenrishin.using_counting_loop(x),
          Ulrich.using_tabulate(x),
          Breyal.using_datatable(x),
          Breyal.using_lapply_1c_eq(x),
          Breyal.using_mclapply_8c_eq(x),
          order = "relative")

Which results in the following timings

                                 test replications elapsed relative user.self sys.self user.child sys.child
6            Ulrich.using_tabulate(x)            5   0.176    1.000     0.176    0.000       0.00     0.000
2          tennenrishin.using_hist(x)            5   0.468    2.659     0.468    0.000       0.00     0.000
3          tennenrishin.using_sort(x)            5   0.687    3.903     0.688    0.000       0.00     0.000
4          tennenrishin.using_ecdf(x)            5   0.749    4.256     0.748    0.000       0.00     0.000
7           Breyal.using_datatable(x)            5   2.960   16.818     2.960    0.000       0.00     0.000
1         tennenrishin.using_table(x)            5   4.651   26.426     4.596    0.052       0.00     0.000
9      Breyal.using_mclapply_8c_eq(x)            5  10.817   61.460     0.140    1.196      54.62     7.112
5 tennenrishin.using_counting_loop(x)            5  10.922   62.057    10.912    0.000       0.00     0.000
8        Breyal.using_lapply_1c_eq(x)            5  36.807  209.131    36.768    0.000       0.00     0.000

answered Nov 14 '22 22:11

Tony Breyal

Related questions
                            
                                Installing Rmpi on LAM/MPI cluster
                            
                                How do I extract hashtags from tweets in R?
                            
                                How to get sum of values every 8 days by date in data frame in R
                            
                                replace list elements (avoid global assignment)
                            
                                Hashing function for mapping integers to a given range?
                            
                                How do I set width of candles in candle chart using plot.xts?
                            
                                "last name, first name" -> "first name last name" in serialized strings
                            
                                Conditionally remove rows from dataframe (more than one conditions)
                            
                                Blocking and waiting in R
                            
                                Add Regression Line ggplot for Only Certain Groups
                            
                                How to adjust the tile height in geom tile?
                            
                                Remove duplicated columns in matrix
                            
                                Attractive 3D plot in R
                            
                                Rewriting slow R function in C++ & Rcpp
                            
                                ggplot2 mapping county boundries in one color and state boundries in another on the same map
                            
                                create templates using ggplot2 syntax?
                            
                                Use of offset in lm regression - R
                            
                                Connect R and Vertica using RODBC
                            
                                Remove consecutive duplicate entries
                            
                                Make a boxplot without whiskers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With