Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should I use HTML::Parser or XML::Parser to extract and replace text? [closed]

I am looking at being able to extract all plain text and analyse/amend from HTML/XHTML document and then replace if needed. Can I do this using HTML::Parser or should it be XML::Parser?

Are there any good demonstrations that anyone knows of?

like image 339
Phil Jackson Avatar asked Dec 18 '22 03:12

Phil Jackson


2 Answers

The approach of HTML::Parser is based on tokens and callbacks. I find it very convenient when you have particularly complex conditions on the context in which the data you whish to extract or to change occurs.

Otherwise I prefer a tree based approach. HTML::TreeBuilder::XPath (based ultimely on HTML::Parser) allows you to find nodes with XPath. It returns HTML::Elements. The documentation is a little scarce (well, spread over a couple of modules). But still the quick way to mine into HTML.

If you deal with pure XML, XML::Twig is an outstanding parser: very good memory management, allows to combine the tree and stream approaches. And the documentation is very good.

like image 57
i-blis Avatar answered Jan 11 '23 23:01

i-blis


Say in someone's StackOverflow user page you want to replace all instances of PERL with Perl. You could do so with

#! /usr/bin/perl

use warnings;
use strict;

use HTML::Parser;
use LWP::Simple;

my $html = get "http://stackoverflow.com/users/201469/phil-jackson";
die "$0: get failed" unless defined $html;

sub replace_text {
  my($skipped,$markup) = @_;
  $skipped =~ s/\bPERL\b/Perl/g;
  print $skipped, $markup;
}

my $p = HTML::Parser->new(
  api_version => 3,
  marked_sections => 1,
  case_sensitive => 1,
  unbroken_text => 1,
  xml_mode => 1,
  start_h => [ \&replace_text => "skipped_text, text" ],
  end_h => [ \&replace_text => "skipped_text, text" ],
);

# your page may use a different encoding
binmode STDOUT, ":utf8" or die "$0: binmode: $!";
$p->parse($html);

The output is what we expect:

$ wget -O phil-jackson.html http://stackoverflow.com/users/201469
$ ./replace-text >out.html
$ diff -ub phil-jackson.html out.html
--- phil-jackson.html
+++ out.html
@@ -327,7 +327,7 @@

 PERL:  

-#$linkTrue =  &hellip; ">comparing PERL md5() and PHP md5()</a></h3>
+#$linkTrue =  &hellip; ">comparing Perl md5() and PHP md5()</a></h3>

         <div class="tags t-php t-perl t-md5">
             <a href="/questions/tagged/php" class="post-tag" title="show questions tagged 'php'" rel="tag">php</a> <a href="/questions/tagged/perl" class="post-tag" title="show questions tagged 'perl'" rel="tag">perl</a> <a href="/questions/tagged/md5" class="post-tag" title="show questions tagged 'md5'" rel="tag">md5</a> 

The "PERL:" sore thumb is part of an element attribute, not a text section.

like image 28
Greg Bacon Avatar answered Jan 11 '23 22:01

Greg Bacon