Count the number of overlapping substrings within a string

Question

example:

s <- "aaabaabaa"
p <- "aa"

I want to return 4, not 3 (i.e. counting the number of "aa" instances in the initial "aaa" as 2, not 1).

Is there any package to solve it? Or is there any way to count in R?

Ben Bolker · Accepted Answer

I believe that

find_overlaps <- function(p,s) {
    gg <- gregexpr(paste0("(?=",p,")"),s,perl=TRUE)[[1]]
    if (length(gg)==1 && gg==-1) 0 else length(gg)
}


find_overlaps("aa","aaabaabaa")  ## 4
find_overlaps("not_there","aaabaabaa") ## 0 
find_overlaps("aa","aaaaaaaa")  ## 7

will do what you want, which would be more clearly expressed as "finding the number of overlapping substrings within a string".

This a minor variation on Finding the indexes of multiple/overlapping matching substrings

Rich Scriven · Answer

substring might be useful here, by taking every successive pair of characters.

( ss <- sapply(2:nchar(s), function(i) substring(s, i-1, i)) )
## [1] "aa" "aa" "ab" "ba" "aa" "ab" "ba" "aa"
sum(ss %in% p)
## [1] 4

Mark Miller · Answer

I needed the answer to a related more-general question. Here is what I came up with generalizing Ben Bolker's solution:

my.data <- read.table(text = '
  my.string   my.cov
     1.2...        1
     .21111        2
     ..2122        3
     ...211        2
     112111        4
     212222        1
', header = TRUE, stringsAsFactors = FALSE)

desired.result.2ch <- read.table(text = '
  my.string   my.cov   n.11   n.12   n.21   n.22
     1.2...        1      0      0      0      0
     .21111        2      3      0      1      0
     ..2122        3      0      1      1      1
     ...211        2      1      0      1      0
     112111        4      3      1      1      0
     212222        1      0      1      1      3
', header = TRUE, stringsAsFactors = FALSE)

desired.result.3ch <- read.table(text = '
  my.string   my.cov   n.111   n.112   n.121   n.122   n.222   n.221   n.212   n.211
     1.2...        1       0       0       0       0       0       0       0       0
     .21111        2       2       0       0       0       0       0       0       1
     ..2122        3       0       0       0       1       0       0       1       0
     ...211        2       0       0       0       0       0       0       0       1
     112111        4       1       1       1       0       0       0       0       1
     212222        1       0       0       0       1       2       0       1       0
', header = TRUE, stringsAsFactors = FALSE)

find_overlaps <- function(s, my.cov, p) {
    gg <- gregexpr(paste0("(?=",p,")"),s,perl=TRUE)[[1]]
    if (length(gg)==1 && gg==-1) 0 else length(gg)
}

p <- c('11', '12', '21', '22', '111', '112', '121', '122', '222', '221', '212', '211')

my.output <- matrix(0, ncol = (nrow(my.data)+1), nrow = length(p))

for(i in seq(1,length(p))) {
    my.data$p <- p[i]
    my.output[i,1] <- p[i]
    my.output[i,(2:(nrow(my.data)+1))] <-apply(my.data, 1, function(x) find_overlaps(x[1],  x[2],  x[3]))
    apply(my.data, 1, function(x) find_overlaps(x[1],  x[2],  x[3]))
}

my.output
desired.result.2ch
desired.result.3ch

pre.final.output <- matrix(t(my.output[,2:7]), ncol=length(p), nrow=nrow(my.data))

final.output <- data.frame(my.data[,1:2], t(apply(pre.final.output, 1, as.numeric)))
colnames(final.output) <- c(colnames(my.data[,1:2]), paste0('x', p))
final.output

#  my.string my.cov x11 x12 x21 x22 x111 x112 x121 x122 x222 x221 x212 x211
#1    1.2...      1   0   0   0   0    0    0    0    0    0    0    0    0
#2    .21111      2   3   0   1   0    2    0    0    0    0    0    0    1
#3    ..2122      3   0   1   1   1    0    0    0    1    0    0    1    0
#4    ...211      2   1   0   1   0    0    0    0    0    0    0    0    1
#5    112111      4   3   1   1   0    1    1    1    0    0    0    0    1
#6    212222      1   0   1   1   3    0    0    0    1    2    0    1    0

pgcudahy · Answer

A tidy, and I think more readable solution is

library(tidyverse)
PatternCount <- function(text, pattern) {
    #Generate all sliding substrings
    map(seq_len(nchar(text) - nchar(pattern) + 1), 
        function(x) str_sub(text, x, x + nchar(pattern) - 1)) %>%
    #Test them against the pattern
    map_lgl(function(x) x == pattern) %>%
    #Count the number of matches
    sum
}

PatternCount("aaabaabaa", "aa")
# 4

Count the number of overlapping substrings within a string

Tags:

string

r

frashman

4 Answers

Ben Bolker

Rich Scriven

Mark Miller

pgcudahy

Recent Activity

Donate For Us

Count the number of overlapping substrings within a string

Tags:

string

r

frashman

4 Answers

Ben Bolker

Rich Scriven

Mark Miller

pgcudahy

Related questions

Recent Activity

Donate For Us