Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I extract an HTML element based on its class?

I'm just starting out in Perl, and wrote a simple script to do some web scraping. I'm using WWW::Mechanize and HTML::TreeBuilder to do most of the work, but I've run into some trouble. I have the following HTML:

<table class="winsTable">
    <thead>...</thead>
    <tbody>
        <tr>
            <td class = "wins">15</td>
        </tr>
    </tbody>
</table>

I know there are some modules that get data from tables, but this is a special case; not all the data I want is in a table. So, I tried:

my $tree = HTML::TreeBuilder->new_from_url( $url );
my @data = $tree->find('td class = "wins"');

But @data returned empty. I know this method would work without the class name, because I've successfully parsed data with $tree->find('strong'). So, is there a module that can handle this type of HTML syntax? I scanned through the HTML::TreeBuilder documentation and didn't find anything that appeared to, but I could be wrong.

like image 292
aquemini Avatar asked Jul 14 '13 03:07

aquemini


2 Answers

You could use the look_down method to find the specific tag and attributes you're looking for. This is in the HTML::Element module (which is imported by HTML::TreeBuilder).

my $data = $tree->look_down(
    _tag  => 'td',
    class => 'wins'
);

print $data->content_list, "\n" if $data; #prints '15' using the given HTML

$data = $tree->look_down(
    _tag  => 'td',
    class => 'losses'
);

print $data->content_list, "\n" if $data; #prints nothing using the given HTML
like image 153
dms Avatar answered Oct 23 '22 19:10

dms


I'm using excellent (but a bit slow sometimes) HTML::TreeBuilder::XPath module:

my $tree = HTML::TreeBuilder::XPath->new_from_content( $mech->content() );
my @data = $tree->findvalues('//table[ @class = "winsTable" ]//td[@class = "wins"]');
like image 34
gangabass Avatar answered Oct 23 '22 17:10

gangabass