I am trying to create a matrix which gives me the occurrence of each element at each position, based on a large number of strings in a vector.
I have the following pet example and potential solution:
set.seed(42)
seqs <- sapply(1:10, FUN = function(x) { paste(sample(LETTERS, size = 11, replace = T), collapse = "") })
test <- lapply(seqs, FUN = function(s) {
do.call(cbind, lapply(LETTERS, FUN = function(ch) {
grepl(ch, unlist(strsplit(s, split="")))
}))
})
testR <- Reduce("+", test)
seqs
# [1] "XYHVQNTDRSL" "SYGMYZDMOXD" "ZYCNKXLVTVK" "RAVAFXPJLAZ" "LYXQZQIJKUB" "TREGNRZTOWE" "HVSGBDFMFSA" "JNAPEJQUOGC" "CHRAFYYTINT"
#[10] "QQFFKYZTTNA"
testR
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23]
[1,] 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0
[2,] 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0
[3,] 1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0
[4,] 2 0 0 0 0 1 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0
[5,] 0 1 0 0 1 2 0 0 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0
[6,] 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0
[7,] 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0
[8,] 0 0 0 1 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 3 1 1 0
[9,] 0 0 0 0 0 1 0 0 1 0 1 1 0 0 3 0 0 1 0 2 0 0 0
[10,] 1 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 2 0 1 1 1
[11,] 2 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0
[,24] [,25] [,26]
[1,] 1 0 1
[2,] 0 4 0
[3,] 1 0 0
[4,] 0 0 0
[5,] 0 1 1
[6,] 2 2 1
[7,] 0 1 2
[8,] 0 0 0
[9,] 0 0 0
[10,] 1 0 0
[11,] 0 0 1
I am trying to force myself to not use loops but instead use the vectorised functions but I am not sure if my solution is actually a good (efficient) solution or if I have gotten confused somewhere. It is also fairly difficult to debug if real life data messes up somehow (which sadly is the case).
So my question, what is a good way to solve this problem?
EDIT: Following 989's track, I have done a benchmark test of the proposed solutions here, with a data size more representative of the problem at hand.
library(microbenchmark)
set.seed(42)
seqs <- sapply(1:10000, FUN = function(x) { paste(sample(LETTERS, size = 31, replace = T), collapse = "") })
f.posdef=function(){
test <- lapply(seqs, FUN = function(s) {
do.call(cbind, lapply(LETTERS, FUN = function(ch) {
grepl(ch, unlist(strsplit(s, split="")))
}))
})
(testR <- Reduce("+", test))
}
f.989=function() {
l <- lapply(seqs, function(x) {
m <- matrix(0, nchar(x), 26)
replace(m, cbind(seq(nchar(x)), match(strsplit(x, "")[[1]], LETTERS)), 1)
})
Reduce("+",l)
}
f.docendo1=function()
t(Reduce("+", lapply(strsplit(seqs, "", fixed = TRUE), function(xx)
table(factor(xx, levels = LETTERS), 1:31))))
f.docendo2=function()
t(table(do.call(cbind, strsplit(seqs, "", fixed = TRUE)), rep(1:31, 10000)))
f.akrun=function(){
strsplit(seqs, "") %>%
transpose %>%
map(unlist) %>%
setNames(seq_len(nchar(seqs[1]))) %>%
stack %>%
select(2:1) %>%
table
}
r <- f.posdef()
Note that the main difference between this benchmark and 989's is the sample size.
> all(r==f.989())
[1] TRUE
> all(r==f.docendo1())
[1] TRUE
> all(r==f.docendo2())
[1] TRUE
> all(r==f.akrun())
[1] FALSE
> res <- microbenchmark(f.posdef(), f.989(), f.docendo1(), f.docendo2(), f.akrun())
> autoplot(res)
As the plot shows, akrun's solution is blazing fast but seemingly inaccurate. Thus the gold medal goes to docendo's second solution. However it's probably worth noting that both solutions by docendo as well as the suggestion by 989 has assumptions regarding the length/number of the sample strings or the alphabet size in m <- matrix(0, nchar(x), 26)
In the case of size/length of sample strings (i.e. seqs
) it would be an additional call to nchar
which should not impact the runtime much. I am not sure how one would avoid making the assumption alphabet size, if this is not known a priori.
Here's another approach in base R which requires less looping than the approach by OP:
t(Reduce("+", lapply(strsplit(seqs, "", fixed = TRUE), function(xx)
table(factor(xx, levels = LETTERS), 1:11))))
# A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
# 1 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1
# 2 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 4 0
# 3 1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0
# 4 2 0 0 0 0 1 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0
# 5 0 1 0 0 1 2 0 0 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1
# 6 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 2 2 1
# 7 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 2
# 8 0 0 0 1 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 3 1 1 0 0 0 0
# 9 0 0 0 0 0 1 0 0 1 0 1 1 0 0 3 0 0 1 0 2 0 0 0 0 0 0
# 10 1 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 2 0 1 1 1 1 0 0
# 11 2 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1
Or, possibly more efficient:
t(table(do.call(cbind, strsplit(seqs, "", fixed = TRUE)), rep(1:nchar(seqs[1]), length(seqs))))
We can also use table
once
library(tidyverse)
strsplit(seqs, "") %>%
transpose %>%
map(unlist) %>%
setNames(seq_len(nchar(seqs[1]))) %>%
stack %>%
select(2:1) %>%
table
# values
#ind A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
# 1 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1
# 2 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 4 0
# 3 1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0
# 4 2 0 0 0 0 1 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0
# 5 0 1 0 0 1 2 0 0 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1
# 6 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 2 2 1
# 7 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 2
# 8 0 0 0 1 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 3 1 1 0 0 0 0
# 9 0 0 0 0 0 1 0 0 1 0 1 1 0 0 3 0 0 1 0 2 0 0 0 0 0 0
# 10 1 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 2 0 1 1 1 1 0 0
# 11 2 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1
Or slightly more compact is by using mtabulate
from qdapTools
library(qdapTools)
strsplit(seqs, "") %>%
transpose %>%
map(unlist) %>%
mtabulate
# A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
#1 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1
#2 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 4 0
#3 1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0
#4 2 0 0 0 0 1 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0
#5 0 1 0 0 1 2 0 0 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1
#6 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 2 2 1
#7 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 2
#8 0 0 0 1 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 3 1 1 0 0 0 0
#9 0 0 0 0 0 1 0 0 1 0 1 1 0 0 3 0 0 1 0 2 0 0 0 0 0 0
#10 1 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 2 0 1 1 1 1 0 0
#11 2 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With