I am trying to figure out the best way to use HTML::TreeBuilder in Perl to extract a few paragraphs of text from some HTML in a XML file.
I had it working using $tree->address (or so I thought) until I realized that not all entries are in the same order.
Without going though every single item in the list, it appears that each entry has several <div> elements, but only one of the <div>'s has <p> elements in it. And none of the <div>'s have classes, which would make this easy.
I have tried several different ways, and so for nothing seems to work in which I can extract the text in the that I want. I have looked at several different examples, but non of them really are close enough to what I am looking for.
It would be nice if something like this worked:
$bodyText = $tree->look_down( '_tag' => 'div' => 'p' );
But that gives me the error:
param list to look_down ends in a key!
Anyways, maybe someone can help point me in the right direction, I have been looking all night, and now my brain hurts.
Thanks!
John
With the vanilla form of HTML::TreeBuilder, this is best done using a code reference as a criterion of look_down. The subroutine will be called for each node in the tree that passes all previous criteria, and a node will be retained if the subroutine returns a true value.
This program shows its use. The anonymous subroutine uses grep to check the children of the node that is passed to it, counting all elements that have a p tag. The array @divs then contains all div elements that have a p child element. You may want to ensure that the @divs contains exactly one element.
use strict;
use warnings;
use feature 'say';
use HTML::TreeBuilder;
my $doc = HTML::TreeBuilder->new_from_content(<<__HTML__);
<div>content</div>
<div>content</div>
<div><p>paragraph</p></div>
<div>content</div>
<div>content</div>
__HTML__
my @divs = $doc->look_down(
_tag => 'div',
sub { grep { ref eq 'HTML::Element' and $_->tag eq 'p' } $_[0]->content_list }
);
say scalar @divs, " found:\n";
say $divs[0]->as_HTML('<>&', ' ');
output
1 found:
<div>
<p>paragraph</div>
However, it is very much neater to employ the enhanced HTML::TreeBuilder::XPath, which allows the data to be addressed using XPath expressions. This allows look_down to be replaced with a findnodes call:
my @divs = $doc->findnodes('//div[p]');
and the result is identical to that of the code above.
Your error message makes sense. The look_down method expects a hash (which is a list of course). You are giving it three elements, so the last one is a key. Keep in mind that the => is also called fat comma and is just a more readable way to write a ,. It is a bit of an odd error message, though.
What you need to do is parse for <div>s first, and parse those for <p>s. You cannot do it in one go with HTML::TreeBuilder. You will get HTML::Element objects for each of the <div>s from the first foreach. Have them look_down for <p>s.
use strict;
use warnings;
use feature qw( say );
use HTML::TreeBuilder 5 -weak;
my $tree = HTML::TreeBuilder->new_from_content(<DATA>);
foreach my $e ($tree->look_down(_tag => 'div')) {
foreach my $f ($e->look_down(_tag => 'p')) {
say $f->as_text;
}
}
__DATA__
<html>
<body>
<div>foo</div>
<div><p>hello world</p></div>
<div>foo2</div>
<div>foo3</div>
<div><p>hello again</p></div>
</body>
</html>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With