Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an R function for counting the occurrence of a given substring within a string?

Tags:

r

I know that for counting the occurrence of one substring I can used str.count(). However, this function doesn't suit my needs. To be more concrete, let's say I have the string "MSAGARRRPR" and I want to count the number that the substring "RR" appears.

stringr::str_count(string = "MSAGARRRPR", pattern = "RR")

will return the number 1. However, in the current example, I'm interested in counting the number of times that an "R" is followed by another "R", and that happens twice.

I have written a function to count it:

occurrences <- function(string, pattern){
     n <- nchar(patter)
     number_pieces <- (nchar(string) - (n - 1))
     pieces <- character(number_pieces)
     for (i in 1:number_pieces){
        pieces[i] <- substring(string, first = i, last = i + (n - 1))
     }
     output <- sum(pieces == pattern)
     return(output)
    }

Now, ocurrences(string = "MSAGARRRPR", pattern = "RR") returns the expected answer: 2

Nevertheless, I'm wondering whether there is a more efficient R function to compute it.

Thanks in advance!

like image 271
eib Avatar asked Jan 31 '21 11:01

eib


2 Answers

You can use lookbehind or lookahead regex :

With positive lookbehind :

stringr::str_count(string = "MSAGARRRPR", pattern = "(?<=R)R")
#[1] 2

stringr::str_count(string = "MSAGARRRPRR", pattern = "(?<=R)R")
#[1] 3

This can also be written with positive lookahead

stringr::str_count(string = "MSAGARRRPR", pattern = "R(?=R)")
#[1] 2

stringr::str_count(string = "MSAGARRRPRR", pattern = "R(?=R)")
#[1] 3
like image 83
Ronak Shah Avatar answered Nov 14 '22 11:11

Ronak Shah


The solutions below do not use regular expressions. (1) generalizes to windows of greater than 2 more easily but (2) and (3) use no packages. No regular expressions are used (if we regard a fixed match in (3) as not being a regular expression).

1) rollapply Split the input x into a vector of single characters xs and then apply a moving window of length 2 comparing each such window to c("R", "R") returning a logical vector. Sum the number of TRUE values in it.

library(zoo)

x <- "MSAGARRRPR"
k <- 2

xs <- unlist(strsplit(x, ""))
sum(rollapply(xs, k, identical, rep("R", k)))
## [1] 2

2) head/tail We could also do it using only base R. xs is from (1).

sum(head(xs, -1) == tail(xs, -1))
## [1] 2

3) gregexpr This one uses gregexpr to return the positions of the R characters and then counts the number of times there is a difference of 1 between an R character and the next R character. The fixed = TRUE could be omitted but we included it to ensure that it just matches on R rather than having R be a regular expression. The input x is defined in (1).

sum(diff(unlist(gregexpr("R", x, fixed = TRUE))) == 1)
## [1] 2
like image 20
G. Grothendieck Avatar answered Nov 14 '22 13:11

G. Grothendieck