Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Progressively find most frequent item in list in R

I would like to go through a list, and check to see if that item is the most frequent item in the list up until that point. The solution I currently have is incredibly slow compared to Python. Is there an effective way to speed it up?

   dat<-data.table(sample(1:50,10000,replace=T))
   k<-1
   correct <- 0  # total correct predictions
   for (i in 2:(nrow(dat)-1)) {
      if (dat[i,V1] %in% dat[1:(i-1),.N,by=V1][order(-N),head(.SD,k)][,V1]) {
         correct <- correct + 1
      }
   }

More generally, I would eventually like to see if an item is one of the k most frequent items up until a point, or if it has one of the k highest values up until a point.

For comparison, here is a very fast implementation in Python:

dat=[random.randint(1,50) for i in range(10000)]
correct=0
k=1
list={}

for i in dat:
    toplist=heapq.nlargest(k,list.iteritems(),key=operator.itemgetter(1))
    toplist=[j[0] for j in toplist]
    if i in toplist:
        correct+=1
    if list.has_key(i):
        list[i]=list[i]+1
    else:
        list[i]=1
like image 575
Jeff Avatar asked Feb 21 '15 02:02

Jeff


1 Answers

Here's what I've got so far (my solution is f3):

set.seed(10)
dat<-data.table(sample(1:3,100,replace=T))
k<-1

f3 <- function(dat) {
    correct <- 0  # total correct predictions
    vf <- factor(dat$V1)
    v <- as.integer(vf) 
    tabs <- integer(max(v))
    for (i in 2:(nrow(dat)-1)) {
        tabs[v[i-1]] <- tabs[v[i-1]] + 1
        #print(tabs)
        #print(v[1:i])
        if (match(v[i],order(tabs,decreasing = T))<=k) {
            correct <- correct + 1
        }
        #print(correct)
        #print('')

    }
    correct
}

f1 <- function(dat) {
    correct <- 0  # total correct predictions
    for (i in 2:(nrow(dat)-1)) {

        if (dat[i,V1] %in% dat[1:(i-1),.N,by=V1][order(-N),head(.SD,k)]) {
            correct <- correct + 1
        }

    }
    correct
}

library(rbenchmark)

print(f1(dat)==f3(dat))

library(rbenchmark)
benchmark(f1(dat),f3(dat),replications=10)

The benchmark results:

     test replications elapsed relative user.self sys.self user.child sys.child
1 f1(dat)           10   2.939  163.278     2.931    0.008          0         0
2 f3(dat)           10   0.018    1.000     0.018    0.000          0         0

are encouraging, but f3 has two problems:

  1. It doesn't always provide the same answer as OP's algorithm because the ties are treated differently,

  2. There is a lot of room for improvement, because tabs are sorted every time anew.

like image 181
Marat Talipov Avatar answered Nov 11 '22 21:11

Marat Talipov