Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance scaling on very large character matrix loop

I have a very large character matrix in R, approximately [500000, 5], containing names. Each row may contain duplicate names. I'd like to know how many distinct names there are on each row. As far as I know, I can't vectorize any of the functions in this loop, right?

For example:

sampleNames <- c("Bob", "Elliot", "Sarah")


# Dimensions [100000, 5]
mat <- matrix(sampleNames[round(runif(500000, 1, 3))], ncol = 5)

NamesPerRow <- vector()

startTime <- Sys.time()
for(i in 1:dim(mat)[1]){
  NamesPerRow[i] <- length(unique(mat[i,])) 
}
Sys.time() - startTime

This takes only 20 seconds on my machine. Very tolerable. However, if the matrix has 5 times as many rows, the loop takes much longer than 100 seconds:

sampleNames <- c("Bob", "Elliot", "Sarah")

# Dimensions [500000, 5]
mat <- matrix(sampleNames[round(runif(2500000, 1, 3))], ncol = 5)


NamesPerRow <- vector()

startTime <- Sys.time()
for(i in 1:dim(mat)[1]){
  NamesPerRow[i] <- length(unique(mat[i,])) 
}
Sys.time() - startTime

This takes 13.12 minutes on my machine. 40 times longer than the 100000x5 matrix. Outrageous!

Any tricks I can use to perform these operations much quicker? Can I indeed vectorize anything here? Is this something I can fix with multi-threading (I'm not familiar)?

Also, what's going on here? Is it typical for computing time to increase at a much faster rate than the data I'm operating on?

Thank you.

like image 983
ch-pub Avatar asked Feb 09 '23 09:02

ch-pub


1 Answers

You can also use rowTabulates from matrixStats package

# Dimensions [500000, 5]
mat <- matrix(sampleNames[round(runif(2500000, 1, 3))], ncol = 5)
library(matrixStats)
startTime <- Sys.time()
mat1 <- matrix(match(mat, sampleNames), ncol=5)
b <- rowSums(rowTabulates(mat1)!=0)
Sys.time() - startTime
# Time difference of 0.2012889 secs

apply() by @Richard Scriven

startTime <- Sys.time()
a <- apply(mat, 1, function(x) length(unique(x)))
Sys.time() - startTime
# Time difference of 4.231503 secs
all.equal(a, b)
# [1] TRUE
like image 153
ExperimenteR Avatar answered Feb 11 '23 23:02

ExperimenteR