Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to move up a node in html tree and extract the link?

I know my question title is not that descriptive but let me explain here.

I am trying to parse the given html document using HTML::TreeBuilder. Now in this html document values 5,1,ABC,DEF are to be validated against the user supplied value and if that validation successfull I have to extract href link.

So, my code is :

my @tag = $tree->look_down( _tag => 'tr', class => qr{\bepeven\scompleted\b} );
for (@tag) {

    query_element($_);
}

sub query_element {

    my @td_tag = $_[0]->look_down( _tag => 'td' );

    my $num1 = shift @td_tag; #Get the first td tag
    my $num2 = shift @td_tag; # Get the second td tag


    #Making sure first/second td tag has numeric value
    $num1 = $1 if $num1->as_text =~ m!(\d+)! or die "no match found";
    $num2 = $1 if $num2->as_text =~ m!(\d+)! or die "no match found";


    #Validating that above value's match the user provided value 5 and 1.
    if ( $num1 eq '5' && $num2 eq '1' ) { 
        say "hurray..!!";

        #Iterating over rest of the td tag to make sure we get the right link from it.
        for (@td_tag) {

            #Check if contains ABC and than procede to fetch the download href link.
            if ($_->look_down(_tag  => 'td', class => qr{[c]}, sub {
                        $_[0]->as_text eq 'ABC';} )
            )   
            {   
                my $text = $_->as_text;
                say "Current node text is: ", $text; #outputs ABC

                #Now from here how do I get the link I want to extract.
            }
        }
    }
}

Now, my approach is first extract the value from td tags and match it against the user specified value if it is a success than look for another user specified value either ABC or DEF in my case it is ABC if it matched than only extract the link.

Now, tag containig ABC or DEF has no fixed position but they will be below the tags containing 5 and 1 value. So, I used $_[0]->as_text eq 'ABC'; to chech that the tag contains ABC now in my tree I am currently at text node ABC from here how do I extract the link href i,e how do I move up the object tree and extract value.

PS: I would have tried xpath here but position of html elements is not that well-defined and structured.

EDIT:

So, I tried $_->tag() and returned td but if I am on td tag than the why the following code doesn't work:

my $link_obj = $_->look_down(_tag => 'a') # It should look for `a` tag.
say $link_obj->as_text;

But it gives the following error:

Can't call method "as_text" on an undefined value.
like image 460
RanRag Avatar asked Sep 13 '12 07:09

RanRag


2 Answers

I hope the following (using my own Marpa::R2::HTML) is helpful. Note that the HTML::TreeBuilder answer finds only one answer. The code below finds two, which I think was the intention.

#!perl

use Marpa::R2::HTML qw(html);

use 5.010;
use strict;
use warnings;

my $answer = html(
    ( \join q{}, <DATA> ),
    {   td => sub { return Marpa::R2::HTML::contents() },
        a  => sub {
            my $href = Marpa::R2::HTML::attributes()->{href};
            return undef if not defined $href;
            return [ link => $href ];
        },
        'td.c' => sub {
            my @values = @{ Marpa::R2::HTML::values() };
            if ( ref $values[0] eq 'ARRAY' ) { return $values[0] }
            return [ test => 'OK' ] if Marpa::R2::HTML::contents eq 'ABC';
            return [ test => 'OK' ] if Marpa::R2::HTML::contents eq 'DEF';
            return [ test => '' ];
        },
        tr => sub {
            my @cells = @{ Marpa::R2::HTML::values() };
            return undef if shift @cells != 5;
            return undef if shift @cells != 1;
            my $ok = 0;
            my $link;
            for my $cell (@cells) {
                my ( $type, $value ) = @{$cell};
                $ok = 1 if $type eq 'test' and $value eq 'OK';
                $link = $value if $type eq 'link';
            }
            return $link if $ok;
            return undef;
        },
        ':TOP' => sub { return Marpa::R2::HTML::values(); }
    }
);

die "No parse" if not defined $answer;
say join "\n", @{$answer};

__DATA__
<table>
    <tbody>

        <tr class="epeven completed">
            <td>5</td>
            <td>1</td>
            <td class="c">ABC</td>
            <td class="c">satus</td>
            <td class="c"><a href="/path/link">Download</a></td>
        </tr>
        <tr class="epeven completed">
            <td>5</td>
            <td>1</td>
            <td class="c">status</td>
            <td class="c">DEF</td>
            <td class="c"><a href="/path2/link">Download</a></td>
        </tr>


    </table>
like image 135
Jeffrey Kegler Avatar answered Nov 27 '22 02:11

Jeffrey Kegler


I'm not certain I understand what you're looking to do, but something along these lines? Use look_down to describe what you want, there's no need to try navigating yourself around the tree; that's going to be fragile.

use strict;
use warnings;
use HTML::TreeBuilder 5 -weak;
use 5.014;

my $tree = HTML::TreeBuilder->new_from_content(<DATA>);


for my $e ($tree->look_down( _tag => 'a',
                             sub { my $e = $_[0];
                                   my $tr = $e->parent->parent; ### Could also use ->lineage to search up through the 
                                                                ### containing elements
                                   return unless $tr->attr('_tag') eq 'tr' and $tr->attr('class') eq 'epeven completed';
                                   return (     $tr->look_down( _tag => 'td', sub { $_[0]->as_text eq '1'; })
                                            and $tr->look_down( _tag => 'td', sub { $_[0]->as_text eq '5'; })
                                            and $tr->look_down( _tag => 'td', class => 'c', sub { $_[0]->as_text eq 'ABC'; })
                                          );
                                 }
                           )
          ) {
    say $e->attr('href');
}


__DATA__

<table>
    <tbody>

        <tr class="epeven completed">
            <td>5</td>
            <td>1</td>
            <td class="c">ABC</td>
            <td class="c">satus</td>
            <td class="c"><a href="/path/link">Download</a></td>
        </tr>
        <tr class="epeven completed">
            <td>5</td>
            <td>1</td>
            <td class="c">status</td>
            <td class="c">DEF</td>
            <td class="c"><a href="/path2/link">Download</a></td>
        </tr>


    </table>

Output:

/path/link
like image 27
Oesor Avatar answered Nov 27 '22 00:11

Oesor