Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Overlapping matches in Regex

Tags:

c#

regex

overlap

I can't seem to find an answer to this problem, and I'm wondering if one exists. Simplified example:

Consider a string "nnnn", where I want to find all matches of "nn" - but also those that overlap with each other. So the regex would provide the following 3 matches:

  1. nnnn
  2. nnnn
  3. nnnn

I realize this is not exactly what regexes are meant for, but walking the string and parsing this manually seems like an awful lot of code, considering that in reality the matches would have to be done using a pattern, not a literal string.

like image 651
jevakallio Avatar asked Nov 26 '08 11:11

jevakallio


People also ask

What is \r and \n in regex?

Regex recognizes common escape sequences such as \n for newline, \t for tab, \r for carriage-return, \nnn for a up to 3-digit octal number, \xhh for a two-digit hex code, \uhhhh for a 4-digit Unicode, \uhhhhhhhh for a 8-digit Unicode.

What does \\ mean in regex?

\\. matches the literal character . . the first backslash is interpreted as an escape character by the Emacs string reader, which combined with the second backslash, inserts a literal backslash character into the string being read. the regular expression engine receives the string \. html?\ ' .

What is non overlapping matches in Python?

“Non-overlapping” means that the string is searched through from left to right, and the next match attempt starts beyond the previous match. If the regex contains one or more capturing groups, re. findall() returns an array of tuples, with each tuple containing text matched by all the capturing groups.

How do you match in regex?

By default, the match is case sensitive. Example : [^abc] will match any character except a,b,c . [first-last] – Character range: Matches any single character in the range from first to last.


3 Answers

Update 2016:

To get nn, nn, nn, SDJMcHattie proposes in the comments (?=(nn)) (see regex101).

(?=(nn))

Original answer (2008)

A possible solution could be to use a positive look behind:

(?<=n)n

It would give you the end position of:

  1. nnnn  
  2. nnnn  
  3. nnnn

As mentioned by Timothy Khouri, a positive lookahead is more intuitive (see example)

I would prefer to his proposition (?=nn)n the simpler form:

(n)(?=(n))

That would reference the first position of the strings you want and would capture the second n in group(2).

That is so because:

  • Any valid regular expression can be used inside the lookahead.
  • If it contains capturing parentheses, the backreferences will be saved.

So group(1) and group(2) will capture whatever 'n' represents (even if it is a complicated regex).


like image 112
VonC Avatar answered Nov 13 '22 22:11

VonC


Using a lookahead with a capturing group works, at the expense of making your regex slower and more complicated. An alternative solution is to tell the Regex.Match() method where the next match attempt should begin. Try this:

Regex regexObj = new Regex("nn");
Match matchObj = regexObj.Match(subjectString);
while (matchObj.Success) {
    matchObj = regexObj.Match(subjectString, matchObj.Index + 1); 
}
like image 33
Jan Goyvaerts Avatar answered Nov 13 '22 20:11

Jan Goyvaerts


AFAIK, there is no pure regex way to do that at once (ie. returning the three captures you request without loop).

Now, you can find a pattern once, and loop on the search starting with offset (found position + 1). Should combine regex use with simple code.

[EDIT] Great, I am downvoted when I basically said what Jan shown...
[EDIT 2] To be clear: Jan's answer is better. Not more precise, but certainly more detailed, it deserves to be chosen. I just don't understand why mine is downvoted, since I still see nothing incorrect in it. Not a big deal, just annoying.

like image 2
PhiLho Avatar answered Nov 13 '22 21:11

PhiLho