Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Infinite loop using a pair of Perl regex matches

I wrote a small Perl script with regular expressions to get HTML components of a website.

I know its not a good way of doing this kind of job, but I was trying to test out my regex skills.

When run with either one of the two regex patterns in the while loop it runs perfectly and displays the correct output. But when I try to check both patterns in the while loop the second pattern matches every time and the loop runs infinitely.

My script:

#!/usr/bin/perl -w
use strict;

while (<STDIN>) {

    while ( (m/<span class=\"itempp\">([^<]+)+?<\/span>/g) ||
            (m/<font size=\"-1\">([^<]+)+?<\/font>/g) ) {
        print "$1\n";
    }
}

I am testing the above script with a sample input:

<a href="http://linkTest">Link title</a>
<span class="itempp">$150</span>
<font size="-1"> (Location)</font>

Desired output:

$150
(Location)

Thank you! Any help would be highly appreciated!

like image 749
javaCity Avatar asked Jul 29 '12 08:07

javaCity


2 Answers

Whenever a global regex fails to match it resets the position where the next global regex will start searching. So when the first of your two patterns fails it forces the second to look from the beginning of the string again.

This behaviour can be disabled by adding the /c modifier, which leaves the position unchanged if a regex fails to match.

In addition, you can improve your patterns by removing the escape characters (" doesn't need escaping and / needn't be escaped if you choose a different delimiter) and the superfluous +? after the captures.

Also use warnings is much better than -w on the command line.

Here is a working version of your code.

use strict;
use warnings;

while (<STDIN>) {

    while( m|<span class="itempp">([^<]+)</span>|gc
            or m|<font size="-1">([^<]+)</font>|gc ) {
        print "$1\n";
    }
}
like image 92
Borodin Avatar answered Sep 21 '22 10:09

Borodin


while (<DATA>) {
    if (m{<(?:span class="itempp"|font size="-1")>\s*([^<]+)}i) {
        print "$1\n";
    }
}

__DATA__
<a href="http://linkTest">Link title</a>
<span class="itempp">$150</span>
<font size="-1"> (Location)</font>
like image 32
cdtits Avatar answered Sep 19 '22 10:09

cdtits