I know that for counting the occurrence of one substring I can used str.count(). However, this function doesn't suit my needs. To be more concrete, let's say I have the string "MSAGARRRPR" and I want to count the number that the substring "RR" appears.
stringr::str_count(string = "MSAGARRRPR", pattern = "RR")
will return the number 1. However, in the current example, I'm interested in counting the number of times that an "R" is followed by another "R", and that happens twice.
I have written a function to count it:
occurrences <- function(string, pattern){
     n <- nchar(patter)
     number_pieces <- (nchar(string) - (n - 1))
     pieces <- character(number_pieces)
     for (i in 1:number_pieces){
        pieces[i] <- substring(string, first = i, last = i + (n - 1))
     }
     output <- sum(pieces == pattern)
     return(output)
    }
Now, ocurrences(string = "MSAGARRRPR", pattern = "RR") returns the expected answer: 2
Nevertheless, I'm wondering whether there is a more efficient R function to compute it.
Thanks in advance!
You can use lookbehind or lookahead regex :
With positive lookbehind :
stringr::str_count(string = "MSAGARRRPR", pattern = "(?<=R)R")
#[1] 2
stringr::str_count(string = "MSAGARRRPRR", pattern = "(?<=R)R")
#[1] 3
This can also be written with positive lookahead
stringr::str_count(string = "MSAGARRRPR", pattern = "R(?=R)")
#[1] 2
stringr::str_count(string = "MSAGARRRPRR", pattern = "R(?=R)")
#[1] 3
                        The solutions below do not use regular expressions. (1) generalizes to windows of greater than 2 more easily but (2) and (3) use no packages. No regular expressions are used (if we regard a fixed match in (3) as not being a regular expression).
1) rollapply Split the input x into a vector of single characters xs and then apply a moving window of length 2 comparing each such window to c("R", "R") returning a logical vector.  Sum the number of TRUE values in it.
library(zoo)
x <- "MSAGARRRPR"
k <- 2
xs <- unlist(strsplit(x, ""))
sum(rollapply(xs, k, identical, rep("R", k)))
## [1] 2
2) head/tail We could also do it using only base R.  xs is from (1).
sum(head(xs, -1) == tail(xs, -1))
## [1] 2
3) gregexpr  This one uses gregexpr to return the positions of the R characters and then counts the number of times there is a difference of 1 between an R character and the next R character.  The fixed = TRUE could be omitted but we included it to ensure that it just matches on R rather than having R be a regular expression. The input x is defined in (1).
sum(diff(unlist(gregexpr("R", x, fixed = TRUE))) == 1)
## [1] 2
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With