Creating an occurrence matrix for each char from a large number of equal length strings

Question

I am trying to create a matrix which gives me the occurrence of each element at each position, based on a large number of strings in a vector.

I have the following pet example and potential solution:

set.seed(42)
seqs <- sapply(1:10, FUN = function(x) { paste(sample(LETTERS, size = 11, replace = T), collapse = "") })
test <- lapply(seqs, FUN = function(s) {
  do.call(cbind, lapply(LETTERS, FUN = function(ch) {
    grepl(ch, unlist(strsplit(s, split="")))
  }))
})

testR <- Reduce("+", test)
seqs
# [1] "XYHVQNTDRSL" "SYGMYZDMOXD" "ZYCNKXLVTVK" "RAVAFXPJLAZ" "LYXQZQIJKUB" "TREGNRZTOWE" "HVSGBDFMFSA" "JNAPEJQUOGC" "CHRAFYYTINT"
#[10] "QQFFKYZTTNA"
testR
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23]
 [1,]    0    0    1    0    0    0    0    1    0     1     0     1     0     0     0     0     1     1     1     1     0     0     0
 [2,]    1    0    0    0    0    0    0    1    0     0     0     0     0     1     0     0     1     1     0     0     0     1     0
 [3,]    1    0    1    0    1    1    1    1    0     0     0     0     0     0     0     0     0     1     1     0     0     1     0
 [4,]    2    0    0    0    0    1    2    0    0     0     0     0     1     1     0     1     1     0     0     0     0     1     0
 [5,]    0    1    0    0    1    2    0    0    0     0     2     0     0     1     0     0     1     0     0     0     0     0     0
 [6,]    0    0    0    1    0    0    0    0    0     1     0     0     0     1     0     0     1     1     0     0     0     0     0
 [7,]    0    0    0    1    0    1    0    0    1     0     0     1     0     0     0     1     1     0     0     1     0     0     0
 [8,]    0    0    0    1    0    0    0    0    0     2     0     0     2     0     0     0     0     0     0     3     1     1     0
 [9,]    0    0    0    0    0    1    0    0    1     0     1     1     0     0     3     0     0     1     0     2     0     0     0
[10,]    1    0    0    0    0    0    1    0    0     0     0     0     0     2     0     0     0     0     2     0     1     1     1
[11,]    2    1    1    1    1    0    0    0    0     0     1     1     0     0     0     0     0     0     0     1     0     0     0
      [,24] [,25] [,26]
 [1,]     1     0     1
 [2,]     0     4     0
 [3,]     1     0     0
 [4,]     0     0     0
 [5,]     0     1     1
 [6,]     2     2     1
 [7,]     0     1     2
 [8,]     0     0     0
 [9,]     0     0     0
[10,]     1     0     0
[11,]     0     0     1

I am trying to force myself to not use loops but instead use the vectorised functions but I am not sure if my solution is actually a good (efficient) solution or if I have gotten confused somewhere. It is also fairly difficult to debug if real life data messes up somehow (which sadly is the case).

So my question, what is a good way to solve this problem?

EDIT: Following 989's track, I have done a benchmark test of the proposed solutions here, with a data size more representative of the problem at hand.

library(microbenchmark)
set.seed(42)
seqs <- sapply(1:10000, FUN = function(x) { paste(sample(LETTERS, size = 31, replace = T), collapse = "") })

f.posdef=function(){
  test <- lapply(seqs, FUN = function(s) {
    do.call(cbind, lapply(LETTERS, FUN = function(ch) {
      grepl(ch, unlist(strsplit(s, split="")))
    }))
  })
  (testR <- Reduce("+", test))
}

f.989=function() {
  l <- lapply(seqs, function(x) {
    m <- matrix(0, nchar(x), 26)
    replace(m, cbind(seq(nchar(x)),  match(strsplit(x, "")[[1]], LETTERS)), 1)
  })
  Reduce("+",l)
}

f.docendo1=function()
  t(Reduce("+", lapply(strsplit(seqs, "", fixed = TRUE), function(xx) 
    table(factor(xx, levels = LETTERS), 1:31))))

f.docendo2=function()
  t(table(do.call(cbind, strsplit(seqs, "", fixed = TRUE)), rep(1:31, 10000)))

f.akrun=function(){
  strsplit(seqs, "") %>% 
    transpose %>% 
    map(unlist) %>% 
    setNames(seq_len(nchar(seqs[1]))) %>% 
    stack %>%
    select(2:1) %>% 
    table
}

r <- f.posdef()

Note that the main difference between this benchmark and 989's is the sample size.

> all(r==f.989())
[1] TRUE
> all(r==f.docendo1())
[1] TRUE
> all(r==f.docendo2())
[1] TRUE
> all(r==f.akrun())
[1] FALSE
> res <- microbenchmark(f.posdef(), f.989(), f.docendo1(), f.docendo2(), f.akrun())
> autoplot(res)

benchmark plot

As the plot shows, akrun's solution is blazing fast but seemingly inaccurate. Thus the gold medal goes to docendo's second solution. However it's probably worth noting that both solutions by docendo as well as the suggestion by 989 has assumptions regarding the length/number of the sample strings or the alphabet size in m <- matrix(0, nchar(x), 26)

In the case of size/length of sample strings (i.e. seqs) it would be an additional call to nchar which should not impact the runtime much. I am not sure how one would avoid making the assumption alphabet size, if this is not known a priori.

talat · Accepted Answer

Here's another approach in base R which requires less looping than the approach by OP:

t(Reduce("+", lapply(strsplit(seqs, "", fixed = TRUE), function(xx) 
                               table(factor(xx, levels = LETTERS), 1:11))))

#    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
# 1  0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1
# 2  1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 4 0
# 3  1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0
# 4  2 0 0 0 0 1 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0
# 5  0 1 0 0 1 2 0 0 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1
# 6  0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 2 2 1
# 7  0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 2
# 8  0 0 0 1 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 3 1 1 0 0 0 0
# 9  0 0 0 0 0 1 0 0 1 0 1 1 0 0 3 0 0 1 0 2 0 0 0 0 0 0
# 10 1 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 2 0 1 1 1 1 0 0
# 11 2 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1

Or, possibly more efficient:

t(table(do.call(cbind, strsplit(seqs, "", fixed = TRUE)), rep(1:nchar(seqs[1]), length(seqs))))

akrun · Answer

We can also use table once

library(tidyverse)
strsplit(seqs, "") %>% 
       transpose %>% 
       map(unlist) %>% 
       setNames(seq_len(nchar(seqs[1]))) %>% 
       stack %>%
       select(2:1) %>% 
       table
#   values
#ind  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
#  1  0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1
#  2  1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 4 0
#  3  1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0
#  4  2 0 0 0 0 1 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0
#  5  0 1 0 0 1 2 0 0 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1
#  6  0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 2 2 1
#  7  0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 2
#  8  0 0 0 1 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 3 1 1 0 0 0 0
#  9  0 0 0 0 0 1 0 0 1 0 1 1 0 0 3 0 0 1 0 2 0 0 0 0 0 0
#  10 1 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 2 0 1 1 1 1 0 0
#  11 2 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1

Or slightly more compact is by using mtabulate from qdapTools

library(qdapTools)
strsplit(seqs, "") %>% 
         transpose %>% 
         map(unlist) %>%
         mtabulate
#   A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
#1  0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1
#2  1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 4 0
#3  1 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0
#4  2 0 0 0 0 1 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0
#5  0 1 0 0 1 2 0 0 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1
#6  0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 2 2 1
#7  0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 2
#8  0 0 0 1 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 3 1 1 0 0 0 0
#9  0 0 0 0 0 1 0 0 1 0 1 1 0 0 3 0 0 1 0 2 0 0 0 0 0 0
#10 1 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 2 0 1 1 1 1 0 0
#11 2 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1

Creating an occurrence matrix for each char from a large number of equal length strings

Tags:

string

r

matrix

posdef

2 Answers

talat

akrun

Recent Activity

Donate For Us

Creating an occurrence matrix for each char from a large number of equal length strings

Tags:

string

r

matrix

posdef

2 Answers

talat

akrun

Related questions

Recent Activity

Donate For Us