Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a better way of obtaining the same output as table(vec) where vec is a vector?

Tags:

r

Suppose I have a vector and I don't know, apriori, its unique elements (here: 1 and 2).

vec <-
  c(1, 1, 1, 2, 2, 2, 2)

I was interested in knowing is there a better way (or elegant way) of getting the number of unique elements in vec i.e. the same result as table(vec). It doesn't matter if its a data.frame or a named vector.

R> table(vec)
vec
1 2 
3 4 

Reason: I was curious to know if there is a better way. Also, I noticed that there is a for loop in the base implementation (in addition to .C call). I don't know if it's a big concern, but when I do something like

R> table(rep(1:1000,100000))

R takes really long time. I am sure it's because of the huge number 100000. But is there a way of making it faster?

EDIT This also does a good job in addition to Chase's answer.

R> rle(sort(sampData))
like image 326
suncoolsu Avatar asked Jan 16 '23 13:01

suncoolsu


1 Answers

This is an interesting problem - I'm curious to see other thoughts on this. Looking at the source for table() reveals that it builds off of tabulate(). tabulate() has a few quirks apparently, namely that it only deals with positive integers and returns an integer vector without names. We can use unique() on our vector to apply the names(). If you need to tabulate zero or negative values, I guess going back and reviewing table() would be necessary as tabulate() doesn't seem to do that per the examples on the help page.

table2 <- function(data) {
    x <- tabulate(data)
    y <- sort(unique(data))
    names(x) <- y
    return(x)   
    }

And a quick test:

> set.seed(42)
> sampData <- sample(1:5, 10000000, TRUE, prob = c(.3,.25, .2, .15, .1))
> 
> system.time(table(sampData))
   user  system elapsed 
  4.869   0.669   5.503 
> system.time(table2(sampData))
 user  system elapsed 
0.410   0.200   0.605
> 
> table(sampData)
sampData
      1       2       3       4       5 
2999200 2500232 1998652 1500396 1001520 
> table2(sampData)
      1       2       3       4       5 
2999200 2500232 1998652 1500396 1001520 

EDIT: I just realized there is a count() function in plyr which is another alternative to table(). In the test above, it performs better than table(), and slightly worse than the hack-job solution I put together:

library(plyr)
 system.time(count(sampData))
   user  system elapsed 
  1.620   0.870   2.483
like image 192
Chase Avatar answered Jan 25 '23 15:01

Chase