I have a set of strings which contain space-separated elements. I want to build a matrix which will tell me which elements were part of which strings. For example:
""
"A B C"
"D"
"B D"
Should give something like:
A B C D
1
2 1 1 1
3 1
4 1 1
Now I've got a solution, but it runs slow as molasse, and I've run out of ideas on how to make it faster:
reverseIn <- function(vector, value) {
return(value %in% vector)
}
buildCategoryMatrix <- function(valueVector) {
allClasses <- c()
for(classVec in unique(valueVector)) {
allClasses <- unique(c(allClasses,
strsplit(classVec, " ", fixed=TRUE)[[1]]))
}
resMatrix <- matrix(ncol=0, nrow=length(valueVector))
splitValues <- strsplit(valueVector, " ", fixed=TRUE)
for(cat in allClasses) {
if(cat=="") {
catIsPart <- (valueVector == "")
} else {
catIsPart <- sapply(splitValues, reverseIn, cat)
}
resMatrix <- cbind(resMatrix, catIsPart)
}
colnames(resMatrix) <- allClasses
return(resMatrix)
}
Profiling the function gives me this:
$by.self
self.time self.pct total.time total.pct
"match" 31.20 34.74 31.24 34.79
"FUN" 30.26 33.70 74.30 82.74
"lapply" 13.56 15.10 87.86 97.84
"%in%" 12.92 14.39 44.10 49.11
So my actual questions would be: - Where are the 33% spent in "FUN" coming from? - Would there be any way to speed up the %in% call?
I tried turning the strings into factors prior to going into the loop so that I'd be matching numbers instead of strings, but that actually makes R crash. I've also tried going for partial matrix assignment (IE, resMatrix[i,x] <- 1) where i is the number of the string and x is the vector of factors. No dice there either, as it seems to keep on running infinitely.
In the development version of my "splitstackshape" package, there's a helper function called charBinaryMat
that can be used for something like this:
Here's the function (since the version of the package on CRAN doesn't have it yet):
charBinaryMat <- function(listOfValues, fill = NA) {
lev <- sort(unique(unlist(listOfValues, use.names = FALSE)))
m <- matrix(fill, nrow = length(listOfValues), ncol = length(lev))
colnames(m) <- lev
for (i in 1:nrow(m)) {
m[i, listOfValues[[i]]] <- 1
}
m
}
The input is expected to be the output of strsplit
:
And here it is in use:
str <- c("" , "A B C" , "D" , "B D" )
## Fill is `NA` by default
charBinaryMat(strsplit(str, " ", fixed=TRUE))
# A B C D
# [1,] NA NA NA NA
# [2,] 1 1 1 NA
# [3,] NA NA NA 1
# [4,] NA 1 NA 1
## Can easily be set to another value
charBinaryMat(strsplit(str, " ", fixed=TRUE), fill = 0)
# A B C D
# [1,] 0 0 0 0
# [2,] 1 1 1 0
# [3,] 0 0 0 1
# [4,] 0 1 0 1
Since your question is about a faster approach, let's benchmark.
The functions for benchmarking:
CBM <- function() {
charBinaryMat(strsplit(str, " ", fixed=TRUE), fill = 0)
}
BCM <- function() {
buildCategoryMatrix(str)*1L
}
Sapply <- function() {
y <- unique( unlist( strsplit( str , " " ) ) )
out <- t(sapply(str, function(x) y %in% unlist(strsplit(x , " " )),
USE.NAMES = FALSE )) * 1L
colnames(out) <- y
out
}
Some sample data:
set.seed(1)
A = sample(10, 100000, replace = TRUE)
str <- sapply(seq_along(A), function(x)
paste(sample(LETTERS[1:10], A[x]), collapse = " "))
head(str)
# [1] "H G C" "F H J G" "H D J A I B"
# [4] "A C F H J B E G D I" "F C H" "I C G B J D F A E"
Some sample output:
## Automatically sorted
head(CBM())
# A B C D E F G H I J
# [1,] 0 0 1 0 0 0 1 1 0 0
# [2,] 0 0 0 0 0 1 1 1 0 1
# [3,] 1 1 0 1 0 0 0 1 1 1
# [4,] 1 1 1 1 1 1 1 1 1 1
# [5,] 0 0 1 0 0 1 0 1 0 0
# [6,] 1 1 1 1 1 1 1 0 1 1
## Sorting just for comparison
head(BCM())[, LETTERS[1:10]]
# A B C D E F G H I J
# [1,] 0 0 1 0 0 0 1 1 0 0
# [2,] 0 0 0 0 0 1 1 1 0 1
# [3,] 1 1 0 1 0 0 0 1 1 1
# [4,] 1 1 1 1 1 1 1 1 1 1
# [5,] 0 0 1 0 0 1 0 1 0 0
# [6,] 1 1 1 1 1 1 1 0 1 1
## Sorting just for comparison
head(Sapply())[, LETTERS[1:10]]
# A B C D E F G H I J
# [1,] 0 0 1 0 0 0 1 1 0 0
# [2,] 0 0 0 0 0 1 1 1 0 1
# [3,] 1 1 0 1 0 0 0 1 1 1
# [4,] 1 1 1 1 1 1 1 1 1 1
# [5,] 0 0 1 0 0 1 0 1 0 0
# [6,] 1 1 1 1 1 1 1 0 1 1
Benchmarking:
library(microbenchmark)
microbenchmark(CBM(), BCM(), Sapply(), times=20)
# Unit: milliseconds
# expr min lq median uq max neval
# CBM() 675.0929 718.3454 777.2423 805.3872 858.6609 20
# BCM() 11059.6305 11267.9888 11367.3283 11595.1758 11792.5950 20
# Sapply() 3536.7755 3687.0308 3759.7388 3813.4233 3968.3192 20
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With