Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating an occurrence matrix for each char from a large number of equal length strings

Tags:

string

r

matrix

I am trying to create a matrix which gives me the occurrence of each element at each position, based on a large number of strings in a vector.

I have the following pet example and potential solution:

set.seed(42)
seqs <- sapply(1:10, FUN = function(x) { paste(sample(LETTERS, size = 11, replace = T), collapse = "") })
test <- lapply(seqs, FUN = function(s) {
  do.call(cbind, lapply(LETTERS, FUN = function(ch) {
    grepl(ch, unlist(strsplit(s, split="")))
  }))
})

testR <- Reduce("+", test)
seqs
# [1] "XYHVQNTDRSL" "SYGMYZDMOXD" "ZYCNKXLVTVK" "RAVAFXPJLAZ" "LYXQZQIJKUB" "TREGNRZTOWE" "HVSGBDFMFSA" "JNAPEJQUOGC" "CHRAFYYTINT"
#[10] "QQFFKYZTTNA"
testR
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23]
 [1,]    0    0    1    0    0    0    0    1    0     1     0     1     0     0     0     0     1     1     1     1     0     0     0
 [2,]    1    0    0    0    0    0    0    1    0     0     0     0     0     1     0     0     1     1     0     0     0     1     0
 [3,]    1    0    1    0    1    1    1    1    0     0     0     0     0     0     0     0     0     1     1     0     0     1     0
 [4,]    2    0    0    0    0    1    2    0    0     0     0     0     1     1     0     1     1     0     0     0     0     1     0
 [5,]    0    1    0    0    1    2    0    0    0     0     2     0     0     1     0     0     1     0     0     0     0     0     0
 [6,]    0    0    0    1    0    0    0    0    0     1     0     0     0     1     0     0     1     1     0     0     0     0     0
 [7,]    0    0    0    1    0    1    0    0    1     0     0     1     0     0     0     1     1     0     0     1     0     0     0
 [8,]    0    0    0    1    0    0    0    0    0     2     0     0     2     0     0     0     0     0     0     3     1     1     0
 [9,]    0    0    0    0    0    1    0    0    1     0     1     1     0     0     3     0     0     1     0     2     0     0     0
[10,]    1    0    0    0    0    0    1    0    0     0     0     0     0     2     0     0     0     0     2     0     1     1     1
[11,]    2    1    1    1    1    0    0    0    0     0     1     1     0     0     0     0     0     0     0     1     0     0     0
      [,24] [,25] [,26]
 [1,]     1     0     1
 [2,]     0     4     0
 [3,]     1     0     0
 [4,]     0     0     0
 [5,]     0     1     1
 [6,]     2     2     1
 [7,]     0     1     2
 [8,]     0     0     0
 [9,]     0     0     0
[10,]     1     0     0
[11,]     0     0     1

I am trying to force myself to not use loops but instead use the vectorised functions but I am not sure if my solution is actually a good (efficient) solution or if I have gotten confused somewhere. It is also fairly difficult to debug if real life data messes up somehow (which sadly is the case).

So my question, what is a good way to solve this problem?

EDIT: Following 989's track, I have done a benchmark test of the proposed solutions here, with a data size more representative of the problem at hand.

library(microbenchmark)
set.seed(42)
seqs <- sapply(1:10000, FUN = function(x) { paste(sample(LETTERS, size = 31, replace = T), collapse = "") })

f.posdef=function(){
  test <- lapply(seqs, FUN = function(s) {
    do.call(cbind, lapply(LETTERS, FUN = function(ch) {
      grepl(ch, unlist(strsplit(s, split="")))
    }))
  })
  (testR <- Reduce("+", test))
}

f.989=function() {
  l <- lapply(seqs, function(x) {
    m <- matrix(0, nchar(x), 26)
    replace(m, cbind(seq(nchar(x)),  match(strsplit(x, "")[[1]], LETTERS)), 1)
  })
  Reduce("+",l)
}

f.docendo1=function()
  t(Reduce("+", lapply(strsplit(seqs, "", fixed = TRUE), function(xx) 
    table(factor(xx, levels = LETTERS), 1:31))))

f.docendo2=function()
  t(table(do.call(cbind, strsplit(seqs, "", fixed = TRUE)), rep(1:31, 10000)))

f.akrun=function(){
  strsplit(seqs, "") %>% 
    transpose %>% 
    map(unlist) %>% 
    setNames(seq_len(nchar(seqs[1]))) %>% 
    stack %>%
    select(2:1) %>% 
    table
}

r <- f.posdef()

Note that the main difference between this benchmark and 989's is the sample size.

> all(r==f.989())
[1] TRUE
> all(r==f.docendo1())
[1] TRUE
> all(r==f.docendo2())
[1] TRUE
> all(r==f.akrun())
[1] FALSE
> res <- microbenchmark(f.posdef(), f.989(), f.docendo1(), f.docendo2(), f.akrun())
> autoplot(res)

benchmark plot

As the plot shows, akrun's solution is blazing fast but seemingly inaccurate. Thus the gold medal goes to docendo's second solution. However it's probably worth noting that both solutions by docendo as well as the suggestion by 989 has assumptions regarding the length/number of the sample strings or the alphabet size in m <- matrix(0, nchar(x), 26)

In the case of size/length of sample strings (i.e. seqs) it would be an additional call to nchar which should not impact the runtime much. I am not sure how one would avoid making the assumption alphabet size, if this is not known a priori.

like image 576
posdef Avatar asked Dec 24 '22 18:12

posdef


2 Answers

Here's another approach in base R which requires less looping than the approach by OP:

t(Reduce("+", lapply(strsplit(seqs, "", fixed = TRUE), function(xx) 
                               table(factor(xx, levels = LETTERS), 1:11))))

#    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
# 1  0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1
# 2  1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 4 0
# 3  1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0
# 4  2 0 0 0 0 1 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0
# 5  0 1 0 0 1 2 0 0 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1
# 6  0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 2 2 1
# 7  0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 2
# 8  0 0 0 1 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 3 1 1 0 0 0 0
# 9  0 0 0 0 0 1 0 0 1 0 1 1 0 0 3 0 0 1 0 2 0 0 0 0 0 0
# 10 1 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 2 0 1 1 1 1 0 0
# 11 2 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1

Or, possibly more efficient:

t(table(do.call(cbind, strsplit(seqs, "", fixed = TRUE)), rep(1:nchar(seqs[1]), length(seqs))))
like image 115
talat Avatar answered Jan 30 '23 23:01

talat


We can also use table once

library(tidyverse)
strsplit(seqs, "") %>% 
       transpose %>% 
       map(unlist) %>% 
       setNames(seq_len(nchar(seqs[1]))) %>% 
       stack %>%
       select(2:1) %>% 
       table
#   values
#ind  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
#  1  0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1
#  2  1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 4 0
#  3  1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0
#  4  2 0 0 0 0 1 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0
#  5  0 1 0 0 1 2 0 0 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1
#  6  0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 2 2 1
#  7  0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 2
#  8  0 0 0 1 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 3 1 1 0 0 0 0
#  9  0 0 0 0 0 1 0 0 1 0 1 1 0 0 3 0 0 1 0 2 0 0 0 0 0 0
#  10 1 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 2 0 1 1 1 1 0 0
#  11 2 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1

Or slightly more compact is by using mtabulate from qdapTools

library(qdapTools)
strsplit(seqs, "") %>% 
         transpose %>% 
         map(unlist) %>%
         mtabulate
#   A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
#1  0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1
#2  1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 4 0
#3  1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0
#4  2 0 0 0 0 1 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0
#5  0 1 0 0 1 2 0 0 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1
#6  0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 2 2 1
#7  0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 2
#8  0 0 0 1 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 3 1 1 0 0 0 0
#9  0 0 0 0 0 1 0 0 1 0 1 1 0 0 3 0 0 1 0 2 0 0 0 0 0 0
#10 1 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 2 0 1 1 1 1 0 0
#11 2 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1
like image 22
akrun Avatar answered Jan 30 '23 23:01

akrun