Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to improve this Algorithm?

Tags:

r

R Version 2.11.1 32-bit on Windows 7

I get the data train.txt as below:

USER_A USER_B ACTION
1        7      0
1        8      1
2        6      2
2        7      1
3        8      2

And I deal with the data as the algorithm below:

train_data=read.table("train.txt",header=T)
result=matrix(0,length(unique(train_data$USER_B)),2)
result[,1]=unique(train_data$USER_B)
for(i in 1:dim(result)[1])
{
    temp=train_data[train_data$USER_B%in%result[i,1],]
    result[i,2]=sum(temp[,3])/dim(temp)[1]
}

the result is the score of every USER_B in train_data. the score is defined as:

score of USER_B=(the sum of all the ACTION of USER_B)/(the recommend times of USER_B)

but the train_data is very large, it may take me three days to finish this program, so I come here to ask for help, could this algorithm be improved?

like image 407
PepsiCo Avatar asked Apr 13 '11 06:04

PepsiCo


1 Answers

Running your example, your desired result is to calculate the mean ACTION for each unique USER_B:

     [,1] [,2]
[1,]    7  0.5
[2,]    8  1.0
[3,]    6  2.0

You can do this with one line of code using the ddply() function in package plyr

library(plyr)
ddply(train_data[, -1], .(USER_B), numcolwise(mean))

  USER_B ACTION
1      6    2.0
2      7    0.5
3      8    1.0

Alternatively, the function tapply in base R does the same:

tapply(train_data$ACTION, train_data$USER_B, mean)

Depending on the size of your table, you can get an improvement in execution time of 20x or higher. Here is the system.time test for a data.frame with a million entries. Your algorithm takes 116 seconds, ddply() takes 5.4 seconds, and tapply takes 1.2 seconds:

train_data <- data.frame(
        USER_A = 1:1e6,
        USER_B = sample(1:1e3, size=1e6, replace=TRUE),
        ACTION = sample (1:100, size=1e6, replace=TRUE))

yourfunction <- function(){
    result <- matrix(0,length(unique(train_data$USER_B)),2)
    result[,1] <- unique(train_data$USER_B);
    for(i in 1:dim(result)[1]){     
        temp=train_data[train_data$USER_B%in%result[i,1],]
        result[i,2]=sum(temp[,3])/dim(temp)[1]
    }
    result
}

system.time(XX <- yourfunction())
   user  system elapsed 
 116.29   14.04  134.33 

system.time(YY <- ddply(train_data[, -1], .(USER_B), numcolwise(mean)))
   user  system elapsed 
   5.43    1.60    7.19 

system.time(ZZ <- tapply(train_data$ACTION, train_data$USER_B, mean))
   user  system elapsed 
   1.17    0.06    1.25 
like image 83
Andrie Avatar answered Oct 05 '22 10:10

Andrie