Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get the ultimate URL without fetching the pages using Perl and LWP?

Tags:

redirect

perl

lwp

I'm doing some web scraping using Perl's LWP. I need to process a set of URLs, some of which may redirect (1 or more times).

How can I get ultimate URL with all redirects resolved, using HEAD method?

like image 473
planetp Avatar asked Mar 18 '10 13:03

planetp


2 Answers

If you use the fully featured version of LWP::UserAgent, then the response that is returned is an instance of HTTP::Response which in turn has as an attribute an HTTP::Request. Note that this is NOT necessarily the same HTTP::Request that you created with the original URL in your set of URLs, as described in the HTTP::Response documentation for the method to retrieve the request instance within the response instance:

$r->request( $request )

This is used to get/set the request attribute. The request attribute is a reference to the the request that caused this response. It does not have to be the same request passed to the $ua->request() method, because there might have been redirects and authorization retries in between.

Once you have the request object, you can use the uri method to get the URI. If redirects were used, the URI is the result of following the chain of redirects.

Here's a Perl script, tested and verified, that gives you the skeleton of what you need:

#!/usr/bin/perl

use strict;
use warnings;

use LWP::UserAgent;

my $ua;  # Instance of LWP::UserAgent
my $req; # Instance of (original) request
my $res; # Instance of HTTP::Response returned via request method

$ua = LWP::UserAgent->new;
$ua->agent("$0/0.1 " . $ua->agent);

$req = HTTP::Request->new(HEAD => 'http://www.ecu.edu/wllc');
$req->header('Accept' => 'text/html');

$res = $ua->request($req);

if ($res->is_success) {
    # Using double method invocation, prob. want to do testing of
    # whether res is defined.
    # This is inline version of
    # my $finalrequest = $res->request(); 
    # print "Final URL = " . $finalrequest->url() . "\n";
    print "Final URI = " . $res->request()->uri() . "\n";
} else {
    print "Error: " . $res->status_line . "\n";
}
like image 182
Tony Miller Avatar answered Oct 23 '22 23:10

Tony Miller


As stated in perldoc LWP::UserAgent, the default is to follow redirects for GET and HEAD requests:

$ua = LWP::UserAgent->new( %options )

...
       KEY                     DEFAULT
       -----------             --------------------
       max_redirect            7
       ...
       requests_redirectable   ['GET', 'HEAD']

Here is an example:

#!/usr/bin/perl

use strict; use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new();
$ua->show_progress(1);

my $response = $ua->head('http://unur.com/');

if ( $response->is_success ) {
    print $response->request->uri->as_string, "\n";
}

Output:

** HEAD http://unur.com/ ==> 301 Moved Permanently (1s)
** HEAD http://www.unur.com/ ==> 200 OK
http://www.unur.com/
like image 39
Sinan Ünür Avatar answered Oct 23 '22 23:10

Sinan Ünür