I'm just starting out in Perl, and wrote a simple script to do some web scraping. I'm using WWW::Mechanize and HTML::TreeBuilder to do most of the work, but I've run into some trouble. I have the following HTML:
<table class="winsTable">
<thead>...</thead>
<tbody>
<tr>
<td class = "wins">15</td>
</tr>
</tbody>
</table>
I know there are some modules that get data from tables, but this is a special case; not all the data I want is in a table. So, I tried:
my $tree = HTML::TreeBuilder->new_from_url( $url );
my @data = $tree->find('td class = "wins"');
But @data
returned empty. I know this method would work without the class name, because I've successfully parsed data with $tree->find('strong')
. So, is there a module that can handle this type of HTML syntax? I scanned through the HTML::TreeBuilder documentation and didn't find anything that appeared to, but I could be wrong.
You could use the look_down
method to find the specific tag and attributes you're looking for. This is in the HTML::Element
module (which is imported by HTML::TreeBuilder
).
my $data = $tree->look_down(
_tag => 'td',
class => 'wins'
);
print $data->content_list, "\n" if $data; #prints '15' using the given HTML
$data = $tree->look_down(
_tag => 'td',
class => 'losses'
);
print $data->content_list, "\n" if $data; #prints nothing using the given HTML
I'm using excellent (but a bit slow sometimes) HTML::TreeBuilder::XPath
module:
my $tree = HTML::TreeBuilder::XPath->new_from_content( $mech->content() );
my @data = $tree->findvalues('//table[ @class = "winsTable" ]//td[@class = "wins"]');
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With