Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Help me replace a for loop with an "apply" function

...if that is possible

My task is to find the longest streak of continuous days a user participated in a game.

Instead of writing an sql function, I chose to use the R's rle function, to get the longest streaks and then update my db table with the results.

The (attached) dataframe is something like this:

    day      user_id
2008/11/01    2001
2008/11/01    2002
2008/11/01    2003
2008/11/01    2004
2008/11/01    2005
2008/11/02    2001
2008/11/02    2005
2008/11/03    2001
2008/11/03    2003
2008/11/03    2004
2008/11/03    2005
2008/11/04    2001
2008/11/04    2003
2008/11/04    2004
2008/11/04    2005

I tried the following to get per user longest streak

# turn it to a contingency table
my_table <- table(user_id, day)

# get the streaks
rle_table <- apply(my_table,1,rle)

# verify the longest streak of "1"s for user 2001
# as.vector(tapply(rle_table$'2001'$lengths, rle_table$'2001'$values, max)["1"])

# loop to get the results
# initiate results matrix
res<-matrix(nrow=dim(my_table)[1], ncol=2)

for (i in 1:dim(my_table)[1]) {
string <- paste("as.vector(tapply(rle_table$'", rownames(my_table)[i], "'$lengths, rle_table$'", rownames(my_table)[i], "'$values, max)['1'])", sep="")
res[i,]<-c(as.integer(rownames(my_table)[i]) , eval(parse(text=string)))
}

Unfortunately this for loop takes too long and I' wondering if there is a way to produce the res matrix using a function from the "apply" family.

Thank you in advance

like image 604
George Dontas Avatar asked Dec 17 '22 05:12

George Dontas


1 Answers

The apply functions are not always (or even generally) faster than a for loop. That is a remnant of R's associate with S-Plus (in the latter, apply is faster than for). One exception is lapply, which is frequently faster than for (because it uses C code). See this related question.

So you should use apply primarily to improve the clarity of code, not to improve performance.

You might find Dirk's presentation on high-performance computing useful. One other brute force approach is "just-in-time compilation" with Ra instead of the normal R version, which is optimized to handle for loops.

[Edit:] There are clearly many ways to achieve this, and this is by no means better even if it's more compact. Just working with your code, here's another approach:

dt <- data.frame(table(dat))[,2:3]
dt.b <- by(dt[,2], dt[,1], rle)
t(data.frame(lapply(dt.b, function(x) max(x$length))))

You would probably need to manipulate the output a little further.

like image 171
Shane Avatar answered Dec 28 '22 22:12

Shane