I'm trying to parse an HTML document for a web indexing program. To do this I'm using HTML::TokeParser.
I'm getting an error on the last line of my first if statement:
 if ( $token->[1] eq 'a' ) {
     #href attribute of tag A
     my $suffix = $token->[2]{href};
that says Can't use string ("<./a>") as a HASH ref while "strict refs" in use at ./indexer.pl line 270, <PAGE_DIR> line 1.
Is my problem that (the suffix? or <./a>?) is a string and needs to be turned into a hash ref?  I looked at other posts that had similar errors.. but I'm still not at all sure about this. Thanks for any help.
sub parse_document {
    #passed from input
    my $html_filename = $_[0];
    #base url for links
    my $base_url = $_[1];
    #created to hold tokens
    my @tokens = ();
    #created for doc links
    my @links = ();
    #creates parser
    my $p = HTML::TokeParser->new($html_filename);
    #loops through doc tags
    while (my $token = $p->get_token()) {
        #code for retrieving links
        if ( $token->[1] eq 'a' ) {
            # href attribute of tag A
           my $suffix = $token->[2]{href};
            #if href exists & isn't an email link
            if ( defined($suffix) && !($suffix =~ "^mailto:") ) {
                #make the url absolute
                my $new_url = make_absolute_url $base_url, $suffix;
                #make sure it's of the http:// scheme
                if ($new_url =~ "^http://"){
                    #normalize the url
                    my $new_normalized_url = normalize_url $new_url;
                    #add it to links array
                    push(@links, $new_normalized_url);
                }
            }
        }
        #code for text words
        if ($token->[0] eq 'T') {
            my $text =  $token->[1];
            #add words to end of array
            #(split by non-letter chars)
            my @words = split(/\P{L}+/, $text);
        }
    }
    return (\@tokens, \@links);
}
                The get_token() method returns an array where $token->[2] is a hash reference containing your href only if $token->[0] is an S (that is, a start tag).  In this case, you are matching an end tag (where $token->[0] is an E). See the PerlDoc for details.
To fix, add a
next if $token->[0] ne 'S';
at the top of your loop.
$token->[2] is a string, not a hash reference.
Do a print $token->[2] and you'll see that it is a string containing </a>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With