I know that for counting the occurrence of one substring I can used str.count(). However, this function doesn't suit my needs. To be more concrete, let's say I have the string "MSAGARRRPR" and I want to count the number that the substring "RR" appears.
stringr::str_count(string = "MSAGARRRPR", pattern = "RR")
will return the number 1. However, in the current example, I'm interested in counting the number of times that an "R" is followed by another "R", and that happens twice.
I have written a function to count it:
occurrences <- function(string, pattern){
n <- nchar(patter)
number_pieces <- (nchar(string) - (n - 1))
pieces <- character(number_pieces)
for (i in 1:number_pieces){
pieces[i] <- substring(string, first = i, last = i + (n - 1))
}
output <- sum(pieces == pattern)
return(output)
}
Now, ocurrences(string = "MSAGARRRPR", pattern = "RR")
returns the expected answer: 2
Nevertheless, I'm wondering whether there is a more efficient R function to compute it.
Thanks in advance!
You can use lookbehind or lookahead regex :
With positive lookbehind :
stringr::str_count(string = "MSAGARRRPR", pattern = "(?<=R)R")
#[1] 2
stringr::str_count(string = "MSAGARRRPRR", pattern = "(?<=R)R")
#[1] 3
This can also be written with positive lookahead
stringr::str_count(string = "MSAGARRRPR", pattern = "R(?=R)")
#[1] 2
stringr::str_count(string = "MSAGARRRPRR", pattern = "R(?=R)")
#[1] 3
The solutions below do not use regular expressions. (1) generalizes to windows of greater than 2 more easily but (2) and (3) use no packages. No regular expressions are used (if we regard a fixed match in (3) as not being a regular expression).
1) rollapply Split the input x
into a vector of single characters xs
and then apply a moving window of length 2 comparing each such window to c("R", "R") returning a logical vector. Sum the number of TRUE values in it.
library(zoo)
x <- "MSAGARRRPR"
k <- 2
xs <- unlist(strsplit(x, ""))
sum(rollapply(xs, k, identical, rep("R", k)))
## [1] 2
2) head/tail We could also do it using only base R. xs
is from (1).
sum(head(xs, -1) == tail(xs, -1))
## [1] 2
3) gregexpr This one uses gregexpr
to return the positions of the R characters and then counts the number of times there is a difference of 1 between an R character and the next R character. The fixed = TRUE could be omitted but we included it to ensure that it just matches on R rather than having R be a regular expression. The input x
is defined in (1).
sum(diff(unlist(gregexpr("R", x, fixed = TRUE))) == 1)
## [1] 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With