$dna = "ATCGTTGAATGCAAATGACATGAC";
while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *?
print "Got a TGA stop codon at position ", pos $dna, "\n";
}
The answer is:
Got a TGA stop codon at position 18 Got a TGA stop codon at position 23
Why is the position 18, but not 8? And the following 23. I'm confused how does it match? What's the detail information about the match?
But the right code is:
while ($dna =~ /\G(\w\w\w)*?TGA/g) {
print "Got a TGA stop codon at position ", pos $dna, "\n";
}
This prints:
Got a TGA stop codon at position 18
How?
$1 equals the text " brown ".
The Substitution Operator The substitution operator, s///, is really just an extension of the match operator that allows you to replace the text matched with some new text. The basic form of the operator is − s/PATTERN/REPLACEMENT/;
(\S+) | will match and capture any number (one or more) of non-space characters, followed by a space character (assuming the regular expression isn't modified with a /x flag). In both cases, these constructs appear to be one component of an alternation.
A regex pattern matches a target string. The pattern is composed of a sequence of atoms. An atom is a single point within the regex pattern which it tries to match to the target string. The simplest atom is a literal, but grouping parts of the pattern to match an atom will require using ( ) as metacharacters.
As @Tomalak said you don't need *?
because it is the reason for the confusion in your situation. Here is what is going down in your first piece of code:
It sees that (\w\w\w)*?
is reluctant (optional) so it skips it and tries to match TGA
but no luck so the engine backtracks and matches a three consecutive word characters reading ATC
, now again it tries to match TGA
but no luck again so it reads another three consecutive \w
and the engine has read ATCGTT
so far.
Now it tries TGA
again and no luck, then backtracks and reads \w\w\w
again so now it has ATCGTTGAA
, and now tries to find TGA
but it has already skipped the first one when it read the last three \w
, so this is why the engine fails to find the first TGA
and hence fails to reports it position.
Now the engine continues in this matter until it finds the TGA
after the three AAA
(if you kept going like i was doing you will see how this happens), and now it executes the instructions inside the the loop printing 18.
Since you have used the /g
modifier, the next match attempt starts where the first one has ended and it fails, then it tries another match skipping a single character after the last match and so on until it matches the last TGA
and prints 23.
So why in the second situation it only matches one position at 18, what is the effect of using the \G
modifier ?
Well everything works the same until it finds the first match like the previous situation after the three AAA
, then when the next match starts it tries to match \G
which means try to match where the last match ended after the AAATGA
and it works, then it tries to match the rest of the string but fails, but this time when the engine tries to skip a single character or two or three or so on it will always try to match \G
first which won't happen unless if the match started at the end of the previous (that is after AAATGA
) so it will keep failing, thus reporting only a single match position at 18.
Simply just remove *?
as @Tomalak said.
You don't need to use *?
at all.
$dna = "ATCGTTGAATGCAAATGACATGAC";
while ($dna =~ /(?:\w\w\w)TGA/g) {
print "Got a TGA stop codon at position ", pos $dna, "\n";3.
}
prints
Got a TGA stop codon at position 8 Got a TGA stop codon at position 18
Note that *?
makes the preceding atom optional, but you actually want it to be required.
/[TGAC]{3}TGA/g
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With