I have a very large character matrix in R, approximately [500000, 5], containing names. Each row may contain duplicate names. I'd like to know how many distinct names there are on each row. As far as I know, I can't vectorize any of the functions in this loop, right?
For example:
sampleNames <- c("Bob", "Elliot", "Sarah")
# Dimensions [100000, 5]
mat <- matrix(sampleNames[round(runif(500000, 1, 3))], ncol = 5)
NamesPerRow <- vector()
startTime <- Sys.time()
for(i in 1:dim(mat)[1]){
NamesPerRow[i] <- length(unique(mat[i,]))
}
Sys.time() - startTime
This takes only 20 seconds on my machine. Very tolerable. However, if the matrix has 5 times as many rows, the loop takes much longer than 100 seconds:
sampleNames <- c("Bob", "Elliot", "Sarah")
# Dimensions [500000, 5]
mat <- matrix(sampleNames[round(runif(2500000, 1, 3))], ncol = 5)
NamesPerRow <- vector()
startTime <- Sys.time()
for(i in 1:dim(mat)[1]){
NamesPerRow[i] <- length(unique(mat[i,]))
}
Sys.time() - startTime
This takes 13.12 minutes on my machine. 40 times longer than the 100000x5 matrix. Outrageous!
Any tricks I can use to perform these operations much quicker? Can I indeed vectorize anything here? Is this something I can fix with multi-threading (I'm not familiar)?
Also, what's going on here? Is it typical for computing time to increase at a much faster rate than the data I'm operating on?
Thank you.
You can also use rowTabulates
from matrixStats
package
# Dimensions [500000, 5]
mat <- matrix(sampleNames[round(runif(2500000, 1, 3))], ncol = 5)
library(matrixStats)
startTime <- Sys.time()
mat1 <- matrix(match(mat, sampleNames), ncol=5)
b <- rowSums(rowTabulates(mat1)!=0)
Sys.time() - startTime
# Time difference of 0.2012889 secs
apply()
by @Richard Scriven
startTime <- Sys.time()
a <- apply(mat, 1, function(x) length(unique(x)))
Sys.time() - startTime
# Time difference of 4.231503 secs
all.equal(a, b)
# [1] TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With