Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex match on R gregexpr

Tags:

string

regex

r

I'm trying to get count the instances of 3 consecutive "a" events, "aaa".

The string will comprise the lower alphabet, e.g. "abaaaababaaa"

I tried the following piece of code. But the behavior is not precisely what I am looking for.

x<-"abaaaababaaa";
gregexpr("aaa",x);

I would like the match to return 3 instances of the "aaa" occurrence as opposed to 2.

Assume indexation begins with 1

  • The first occurrence of "aaa" is at index 3.
  • The second occurrence of "aaa" is at index 4. (this is not caught by gregexpr)
  • The third occurrence of "aaa" is at index 10.
like image 572
Aditya Sihag Avatar asked Jan 22 '13 04:01

Aditya Sihag


2 Answers

To catch the overlapping matches, you can use a lookahead like this:

gregexpr("a(?=aa)", x, perl=TRUE)

However, your matches are now just a single "a", so it might complicate further processing of these matches, especially if you're not always looking for fixed-length patterns.

like image 162
Marius Avatar answered Nov 15 '22 03:11

Marius


I know I'm late, but I wanted to share this solution,

your.string <- "abaaaababaaa"
nc1 <- nchar(your.string)-1
x <- unlist(strsplit(your.string, NULL))
x2 <- c()
for (i in 1:nc1)
x2 <- c(x2, paste(x[i], x[i+1], x[i+2], sep="")) 
cat("ocurrences of <aaa> in <your.string> is,", 
    length(grep("aaa", x2)), "and they are at index", grep("aaa", x2))
> ocurrences of <aaa> in <your.string> is, 3 and they are at index 3 4 10

Heavily inspired by this answer from R-help by Fran.

like image 24
Eric Fail Avatar answered Nov 15 '22 03:11

Eric Fail