Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I most reliably preserve HTML Entities when processing HTML documents with Mojo::DOM?

I'm using Mojo::DOM to identify and print out phrases (meaning strings of text between selected HTML tags) in hundreds of HTML documents that I'm extracting from existing content in the Movable Type content management system.

I'm writing those phrases out to a file, so they can be translated into other languages as follows:

        $dom = Mojo::DOM->new(Mojo::Util::decode('UTF-8', $page->text));

    ##########
    #
    # Break down the Body into phrases. This is done by listing the tags and tag combinations that
    # surround each block of text that we're looking to capture.
    #
    ##########

        print FILE "\n\t### Body\n\n";        

        for my $phrase ( $dom->find('h1, h2, h2 b, h3, p, p strong, span, a, caption, th, li, li a')->map('text')->each ) {

            print_phrase($phrase); # utility function to write out the phrase to a file

        }

When Mojo::DOM encountered embedded HTML entities (such as ™ and  ) it converted those entities into encoded characters, rather than passing along as written. I wanted the entities to be passed through as written.

I recognized that I could use Mojo::Util::decode to pass these HTML entities through to the file I'm writing. The problem is "You can only call decode 'UTF-8' on a string that contains valid UTF-8. If it doesn't, for example because it is already converted to Perl characters, it will return undef."

If this is the case, I have to either try to figure out how to test the encoding of the current HTML page before calling Mojo::Util::decode('UTF-8', $page->text), or I must use some other technique to preserve the encoded HTML entities.

How do I most reliably preserve encoded HTML Entities when processing HTML documents with Mojo::DOM?

like image 680
Dave Aiello Avatar asked Mar 12 '19 21:03

Dave Aiello


1 Answers

Looks like when you map to text you get XML entities replaced, but when you instead work with the nodes and use their content, the entities are preserved. This minimal example:

#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;

my $dom = Mojo::DOM->new('<p>this &amp; &quot;that&quot;</p>');
for my $phrase ($dom->find('p')->each) {
    print $phrase->content(), "\n";
}

prints:

this &amp; &quot;that&quot;

If you want to keep your loop and map, replace map('text') with map('content') like this:

for my $phrase ($dom->find('p')->map('content')->each) {

If you have nested tags and want to find only the texts (but not print those nested tag names, only their contents), you'll need to scan the DOM tree:

#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;

my $dom = Mojo::DOM->new('<p><i>this &amp; <b>&quot;</b><b>that</b><b>&quot;</b></i></p><p>done</p>');

for my $node (@{$dom->find('p')->to_array}) {
    print_content($node);
}

sub print_content {
    my ($node) = @_;
    if ($node->type eq "text") {
        print $node->content(), "\n";
    }
    if ($node->type eq "tag") {    
        for my $child ($node->child_nodes->each) {
            print_content($child);
        }
    }
}

which prints:

this & 
"
that
"
done
like image 183
Robert Avatar answered Oct 04 '22 23:10

Robert