I wrote a small Perl script with regular expressions to get HTML components of a website.
I know its not a good way of doing this kind of job, but I was trying to test out my regex skills.
When run with either one of the two regex patterns in the while loop it runs perfectly and displays the correct output. But when I try to check both patterns in the while loop the second pattern matches every time and the loop runs infinitely.
My script:
#!/usr/bin/perl -w
use strict;
while (<STDIN>) {
while ( (m/<span class=\"itempp\">([^<]+)+?<\/span>/g) ||
(m/<font size=\"-1\">([^<]+)+?<\/font>/g) ) {
print "$1\n";
}
}
I am testing the above script with a sample input:
<a href="http://linkTest">Link title</a>
<span class="itempp">$150</span>
<font size="-1"> (Location)</font>
Desired output:
$150
(Location)
Thank you! Any help would be highly appreciated!
Whenever a global regex fails to match it resets the position where the next global regex will start searching. So when the first of your two patterns fails it forces the second to look from the beginning of the string again.
This behaviour can be disabled by adding the /c
modifier, which leaves the position unchanged if a regex fails to match.
In addition, you can improve your patterns by removing the escape characters ("
doesn't need escaping and /
needn't be escaped if you choose a different delimiter) and the superfluous +?
after the captures.
Also use warnings
is much better than -w
on the command line.
Here is a working version of your code.
use strict;
use warnings;
while (<STDIN>) {
while( m|<span class="itempp">([^<]+)</span>|gc
or m|<font size="-1">([^<]+)</font>|gc ) {
print "$1\n";
}
}
while (<DATA>) {
if (m{<(?:span class="itempp"|font size="-1")>\s*([^<]+)}i) {
print "$1\n";
}
}
__DATA__
<a href="http://linkTest">Link title</a>
<span class="itempp">$150</span>
<font size="-1"> (Location)</font>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With