HTML Treebuilder XPath to Extract Links

Question

I am writing a basic script which just extracts all the links from a web page. It is written in Perl and makes use of WWW::Mechanize and HTML::Treebuilder::Xpath modules, both of which I have installed through CPAN.

I know it can be easily done using only WWW::Mechanize, however would like to learn to do it using XPath as well.

So, the script will parse the entire web page, and check the href attribute for every anchor tag, extract the link and print it to the console/write it to a file. Please note that in the script below, I have not used use strict, since I am only writing this to clarify and understand the concept of using XPath to traverse the HTML Tree.

here is the script:

#! /usr/bin/perl

use WWW::Mechanize;
use HTML::TreeBuilder::XPath;
use warnings;

$url="https://example.com";

$mech=WWW::Mechanize->new();
$mech->get($url);

$tree=HTML::TreeBuilder::XPath->new();

$tree->parse($mech->content);

$nodes=$tree->findnodes(q{'//a'}); # line is modified later.

foreach $node($nodes)
{
    print $node->attr('href');
}

And it gives an error:

Can't locate object method "attr" via package "XML::XPathEngine::Literal" at pagegetter.pl line 23.

I have modified the script as follows:

$nodes=$tree->findnodes(q{'//a/@href'});

while($node=$nodes->shift)
{
  print $node->attr('href');
}

Error:

Can't locate object method "shift" via package "XML::XPathEngine::Literal"

I am not sure, how to print the value of the href attribute.

$nodes should hold the list of all the href attributes? I believe it does not store the value but instead pointers to it?

I tried searching and reading examples, however I am not sure how to go about it.

Thanks.

daxim · Accepted Answer

There are a couple of mistakes. Repairs:

# list context
my @nodes = $tree->findnodes(
    q{//a}       # just a string, not a string containings quotes
);

# iterate over array
for my $node (@nodes) {

HTML Treebuilder XPath to Extract Links

Tags:

html

perl

xpath

html-tree

Neon Flash

1 Answers

daxim

Recent Activity

Donate For Us

HTML Treebuilder XPath to Extract Links

Tags:

html

perl

xpath

html-tree

Neon Flash

1 Answers

daxim

Related questions

Recent Activity

Donate For Us