Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

locating div with paragraph in perl with HTML::TreeBuilder

I am trying to figure out the best way to use HTML::TreeBuilder in Perl to extract a few paragraphs of text from some HTML in a XML file.

I had it working using $tree->address (or so I thought) until I realized that not all entries are in the same order.

Without going though every single item in the list, it appears that each entry has several <div> elements, but only one of the <div>'s has <p> elements in it. And none of the <div>'s have classes, which would make this easy.

I have tried several different ways, and so for nothing seems to work in which I can extract the text in the that I want. I have looked at several different examples, but non of them really are close enough to what I am looking for.

It would be nice if something like this worked:

$bodyText = $tree->look_down( '_tag' => 'div' => 'p' );

But that gives me the error:

param list to look_down ends in a key!

Anyways, maybe someone can help point me in the right direction, I have been looking all night, and now my brain hurts.

Thanks!

John

like image 876
John B Avatar asked Dec 13 '25 18:12

John B


2 Answers

With the vanilla form of HTML::TreeBuilder, this is best done using a code reference as a criterion of look_down. The subroutine will be called for each node in the tree that passes all previous criteria, and a node will be retained if the subroutine returns a true value.

This program shows its use. The anonymous subroutine uses grep to check the children of the node that is passed to it, counting all elements that have a p tag. The array @divs then contains all div elements that have a p child element. You may want to ensure that the @divs contains exactly one element.

use strict;
use warnings;

use feature 'say';

use HTML::TreeBuilder;

my $doc = HTML::TreeBuilder->new_from_content(<<__HTML__);
<div>content</div>
<div>content</div>
<div><p>paragraph</p></div>
<div>content</div>
<div>content</div>
__HTML__

my @divs = $doc->look_down(
  _tag => 'div', 
  sub { grep { ref eq 'HTML::Element' and  $_->tag eq 'p' } $_[0]->content_list }
);

say scalar @divs, " found:\n";
say $divs[0]->as_HTML('<>&', '  ');

output

1 found:

<div>
  <p>paragraph</div>

However, it is very much neater to employ the enhanced HTML::TreeBuilder::XPath, which allows the data to be addressed using XPath expressions. This allows look_down to be replaced with a findnodes call:

my @divs = $doc->findnodes('//div[p]');

and the result is identical to that of the code above.

like image 176
Borodin Avatar answered Dec 15 '25 08:12

Borodin


Your error message makes sense. The look_down method expects a hash (which is a list of course). You are giving it three elements, so the last one is a key. Keep in mind that the => is also called fat comma and is just a more readable way to write a ,. It is a bit of an odd error message, though.

What you need to do is parse for <div>s first, and parse those for <p>s. You cannot do it in one go with HTML::TreeBuilder. You will get HTML::Element objects for each of the <div>s from the first foreach. Have them look_down for <p>s.

use strict;
use warnings;
use feature qw( say );
use HTML::TreeBuilder 5 -weak;

my $tree = HTML::TreeBuilder->new_from_content(<DATA>);
foreach my $e ($tree->look_down(_tag => 'div')) {
  foreach my $f ($e->look_down(_tag => 'p')) {
    say $f->as_text;
  }
}

__DATA__
<html>
<body>
<div>foo</div>
<div><p>hello world</p></div>
<div>foo2</div>
<div>foo3</div>
<div><p>hello again</p></div>
</body>
</html>
like image 40
simbabque Avatar answered Dec 15 '25 08:12

simbabque



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!