Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can you explain why \G in my Perl regex pattern behaves this way?

Tags:

regex

perl

$dna = "ATCGTTGAATGCAAATGACATGAC";
while ($dna =~ /(\w\w\w)*?TGA/g) {  # note the minimal *?        
    print "Got a TGA stop codon at position ", pos $dna, "\n";
}

The answer is:

Got a TGA stop codon at position 18    
Got a TGA stop codon at position 23

Why is the position 18, but not 8? And the following 23. I'm confused how does it match? What's the detail information about the match?

But the right code is:

while ($dna =~ /\G(\w\w\w)*?TGA/g) {        
  print "Got a TGA stop codon at position ", pos $dna, "\n";
}

This prints:

Got a TGA stop codon at position 18

How?

like image 964
user2677944 Avatar asked Aug 19 '13 09:08

user2677944


People also ask

What is the meaning of $1 in Perl regex?

$1 equals the text " brown ".

What is S in Perl regex?

The Substitution Operator The substitution operator, s///, is really just an extension of the match operator that allows you to replace the text matched with some new text. The basic form of the operator is − s/PATTERN/REPLACEMENT/;

What does \s+ mean in Perl?

(\S+) | will match and capture any number (one or more) of non-space characters, followed by a space character (assuming the regular expression isn't modified with a /x flag). In both cases, these constructs appear to be one component of an alternation.

How do you define a regex pattern?

A regex pattern matches a target string. The pattern is composed of a sequence of atoms. An atom is a single point within the regex pattern which it tries to match to the target string. The simplest atom is a literal, but grouping parts of the pattern to match an atom will require using ( ) as metacharacters.


2 Answers

As @Tomalak said you don't need *? because it is the reason for the confusion in your situation. Here is what is going down in your first piece of code:

It sees that (\w\w\w)*? is reluctant (optional) so it skips it and tries to match TGA but no luck so the engine backtracks and matches a three consecutive word characters reading ATC, now again it tries to match TGA but no luck again so it reads another three consecutive \w and the engine has read ATCGTT so far.

Now it tries TGA again and no luck, then backtracks and reads \w\w\w again so now it has ATCGTTGAA, and now tries to find TGA but it has already skipped the first one when it read the last three \w, so this is why the engine fails to find the first TGA and hence fails to reports it position.

Now the engine continues in this matter until it finds the TGA after the three AAA (if you kept going like i was doing you will see how this happens), and now it executes the instructions inside the the loop printing 18.

Since you have used the /g modifier, the next match attempt starts where the first one has ended and it fails, then it tries another match skipping a single character after the last match and so on until it matches the last TGA and prints 23.

So why in the second situation it only matches one position at 18, what is the effect of using the \G modifier ?

Well everything works the same until it finds the first match like the previous situation after the three AAA, then when the next match starts it tries to match \G which means try to match where the last match ended after the AAATGA and it works, then it tries to match the rest of the string but fails, but this time when the engine tries to skip a single character or two or three or so on it will always try to match \G first which won't happen unless if the match started at the end of the previous (that is after AAATGA) so it will keep failing, thus reporting only a single match position at 18.

Simply just remove *? as @Tomalak said.

like image 67
Ibrahim Najjar Avatar answered Nov 15 '22 05:11

Ibrahim Najjar


You don't need to use *? at all.

$dna = "ATCGTTGAATGCAAATGACATGAC";
while ($dna =~ /(?:\w\w\w)TGA/g) {
    print "Got a TGA stop codon at position ", pos $dna, "\n";3.    
}

prints

Got a TGA stop codon at position 8
Got a TGA stop codon at position 18

Note that *? makes the preceding atom optional, but you actually want it to be required.

  • The non-capturing group (?: ...) is not really necessary. You could use a normal group.
  • Another variant would be /[TGAC]{3}TGA/g.
like image 37
Tomalak Avatar answered Nov 15 '22 05:11

Tomalak