Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling 404 and internal server errors with perl WWW::Mechanize

I am using WWW::Mechanize to crawl sites, and it works great except for sometimes it will hit a page that returns error code 404 or 500 (not found or internal server error), and then my script will just exit and stop running. This is really messing with my data collection, so is there anyway that WWW::Mechanize will let me catch these errors and see what kind of error code was returned (i.e. 404,500, etc.). Thanks for the help!

like image 296
srchulo Avatar asked Jan 17 '23 11:01

srchulo


2 Answers

You need to disable autocheck:

my $mech = WWW::Mechanize->new( autocheck => 0 );

$mech->get("http://somedomain.com");

if ( $mech->success() ) {
    ...
}
else {
    print "status is: " . $mech->status;
}

Also, as an aside, have a look at WWW::Mechanize::Cached::GZip and WWW::Mechanize::Cached to speed up your development when testing your mech scripts.

like image 185
oalders Avatar answered Jan 30 '23 23:01

oalders


Turn off autocheck and manually check status(), which returns the HTTP status code of the response.

This is a 3-digit number like 200 for OK, 404 for Not Found, and so on.

use strict;
use warnings;
use WWW::Mechanize;

my $url = 'http://...';
my $mech = WWW::Mechanize->new(autocheck => 0);
$mech->get($url);

print $mech->status();

See http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html for Status Code Definitions.

If the status code is 400 or above, then you got error...

like image 26
Ωmega Avatar answered Jan 30 '23 23:01

Ωmega