Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Perl, how do you find the position of a match in a string, if forced to use a foreach loop? pos

Tags:

foreach

perl

I have to find all the positions of matching strings within a larger string using a while loop, and as a second method using a foreach loop. I have figured out the while loop method, but I am stuck on a foreach method. Here is the 'while' method:

....

my $sequence = 
   'AACAAATTGAAACAATAAACAGAAACAAAAATGGATGCGATCAAGAAAAAGATGC'.
   'AGGCGATGAAAATCGAGAAGGATAACGCTCTCGATCGAGCCGATGCCGCGGAAGA'.
   'AAAAGTACGTCAAATGACGGAAAAGTTGGAACGAATCGAGGAAGAACTACGTGAT'.
   'ACCCAGAAAAAGATGATGCNAACTGAAAATGATTTAGATAAAGCACAGGAAGATT'.
   'TATCTGTTGCAAATACCAACTTGGAAGATAAGGAAAAGAAAGTTCAAGAGGCGGA'.
   'GGCTGAGGTAGCANCCCTGAATCGTCGTATGACACTTCTGGAAGAGGAATTGGAA'.
   'CGAGCTGAGGAACGTTTGAAGATTGCAACGGATAAATTGGAAGAAGCAACACATA'.
   'CAGCTGATGAATCTGAACGTGTTCGCNAGGTTATGGAAA';

my $string = <STDIN>;
chomp $string;

while ($sequence =~ /$string/gi )
{
 printf "Sequence found at position: %d\n", pos($sequence)- length($string);
}

Here is my foreach method:

foreach  ($sequence =~ /$string/gi ) 

 printf "Sequence found at position: %d\n", pos($sequence) - length($string); 
}

Could someone please give me a clue on why it doesn't work the same way? Thanks!

My Output if I input "aaca":

Part 1 using a while loop
Sequence found at position: 0
Sequence found at position: 10
Sequence found at position: 17
Sequence found at position: 23
Sequence found at position: 377

Part 2 using a foreach loop
Sequence found at position: -4
Sequence found at position: -4
Sequence found at position: -4
Sequence found at position: -4
Sequence found at position: -4
like image 559
user83598 Avatar asked Jan 31 '11 21:01

user83598


1 Answers

Your problem here is context. In the while loop, the condition is in scalar context. In scalar context, the match operator in g mode will sequentially match along the string. Thus checking pos within the loop does what you want.

In the foreach loop, the condition is in list context. In list context, the match operator in g mode will return a list of all matches (and it will calculate all of the matches before the loop body is ever entered). foreach is then loading the matches one by one into $_ for you, but you are never using the variable. pos in the body of the loop is not useful as it contains the result after the matches have ended.

The takeaway here is that if you want pos to work, and you are using the g modifier, you should use the while loop which imposes scalar context and makes the regex iterate across the matches in the string.

Sinan inspired me to write a few foreach examples:

  • This one is fairly succinct using split in separator retention mode:

    my $pos = 0;
    foreach (split /($string)/i => $sequence) {
        print "Sequence found at position: $pos\n" if lc eq lc $string;
        $pos += length;
    }
    
  • A regex equivalent of the split solution:

    my $pos = 0;
    foreach ($sequence =~ /(\Q$string\E|(?:(?!\Q$string\E).)+)/gi) {
        print "Sequence found at position: $pos\n" if lc eq lc $string;
        $pos += length;
    }
    
  • But this is clearly the best solution for your problem:

    {package Dumb::Homework;
        sub TIEARRAY {
            bless {
                haystack => $_[1],
                needle   => $_[2],
                size     => 2**31-1,
                pos      => [],
            }
        }
        sub FETCH {
            my ($self, $index) = @_;
            my ($pos, $needle) = @$self{qw(pos needle)};
    
            return $$pos[$index] if $index < @$pos;
    
            while ($index + 1 >= @$pos) {
                unless ($$self{haystack} =~ /\Q$needle/gi) {
                    $$self{size} = @$pos;
                    last
                }
                push @$pos, pos ($$self{haystack}) - length $needle;
            }
            $$pos[$index]
        }
        sub FETCHSIZE {$_[0]{size}}
    }
    
    tie my @pos, 'Dumb::Homework' => $sequence, $string;
    
    print "Sequence found at position: $_\n" foreach @pos; # look how clean it is
    

    The reason its the best is because the other two solutions have to process the entire global match first, before you ever see a result. For large inputs (like DNA) that could be a problem. The Dumb::Homework package implements an array that will lazily find the next position each time the foreach iterator asks for it. It will even store the positions so you can get to them again without reprocessing. (In truth it looks one match past the requested match, this allows it to end properly in the foreach, but still much better than processing the whole list)

  • Actually, the best solution is still to not use foreach as it is not the correct tool for the job.

like image 112
Eric Strom Avatar answered Nov 28 '22 23:11

Eric Strom