I am looking at being able to extract all plain text and analyse/amend from HTML/XHTML document and then replace if needed. Can I do this using HTML::Parser or should it be XML::Parser?
Are there any good demonstrations that anyone knows of?
The approach of HTML::Parser is based on tokens and callbacks. I find it very convenient when you have particularly complex conditions on the context in which the data you whish to extract or to change occurs.
Otherwise I prefer a tree based approach. HTML::TreeBuilder::XPath (based ultimely on HTML::Parser) allows you to find nodes with XPath. It returns HTML::Elements. The documentation is a little scarce (well, spread over a couple of modules). But still the quick way to mine into HTML.
If you deal with pure XML, XML::Twig is an outstanding parser: very good memory management, allows to combine the tree and stream approaches. And the documentation is very good.
Say in someone's StackOverflow user page you want to replace all instances of PERL with Perl. You could do so with
#! /usr/bin/perl
use warnings;
use strict;
use HTML::Parser;
use LWP::Simple;
my $html = get "http://stackoverflow.com/users/201469/phil-jackson";
die "$0: get failed" unless defined $html;
sub replace_text {
my($skipped,$markup) = @_;
$skipped =~ s/\bPERL\b/Perl/g;
print $skipped, $markup;
}
my $p = HTML::Parser->new(
api_version => 3,
marked_sections => 1,
case_sensitive => 1,
unbroken_text => 1,
xml_mode => 1,
start_h => [ \&replace_text => "skipped_text, text" ],
end_h => [ \&replace_text => "skipped_text, text" ],
);
# your page may use a different encoding
binmode STDOUT, ":utf8" or die "$0: binmode: $!";
$p->parse($html);
The output is what we expect:
$ wget -O phil-jackson.html http://stackoverflow.com/users/201469 $ ./replace-text >out.html $ diff -ub phil-jackson.html out.html --- phil-jackson.html +++ out.html @@ -327,7 +327,7 @@ PERL: -#$linkTrue = … ">comparing PERL md5() and PHP md5()</a></h3> +#$linkTrue = … ">comparing Perl md5() and PHP md5()</a></h3> <div class="tags t-php t-perl t-md5"> <a href="/questions/tagged/php" class="post-tag" title="show questions tagged 'php'" rel="tag">php</a> <a href="/questions/tagged/perl" class="post-tag" title="show questions tagged 'perl'" rel="tag">perl</a> <a href="/questions/tagged/md5" class="post-tag" title="show questions tagged 'md5'" rel="tag">md5</a>
The "PERL:" sore thumb is part of an element attribute, not a text section.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With