Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory leak using WWW::Mechanize

I have this script in Perl, and it is getting a "Out of memory" error after a few minutes of running. I can't see any circular references and I can't work out why it is happening.

use feature 'say';
use WWW::Mechanize;
use HTML::TreeBuilder::XPath;
use utf8;

$url = "some url";

my $mech = new WWW::Mechanize;
$mech->get($url);
my $html = HTML::TreeBuilder::XPath->new_from_content($mech->content);
my $html2;

do { 
    for $item ($html->findnodes('//li[@class="dataset-item"]'))
    {
        my $title = $item->findvalue('normalize-space(.//a[2])');
        next unless $title =~ /environmental impact statement/i;        
        my $link = $item->findvalue('.//a[2]/@href');
        $mech->get($link);
        $html2 = HTML::TreeBuilder::XPath->new_from_content($mech->content);
        my @pdflinks = $html2->findvalues('//a[@title="Go to external URL"]/@href');
        my $date = $html2->findvalue('//tr[th="Date Created"]/td');
        for $pdflink (@pdflinks)
        {
            next unless $pdflink =~ /\.pdf$/;
            $mech->get($pdflink);
            $mech->save_content($filename = $mech->response->filename);
            say "Title: $title\nDate: $date\nFilename: $filename\n";
        }
    }
    if ($nextpage = $html->findvalue('//ul[@class="pagination"]/li/a[.="»"]/@href'))
    {
        say "Next Page: $nextpage\n";
        $mech->get("some site" . $nextpage);
        $html = HTML::TreeBuilder::XPath->new_from_content($mech->content);
    }
} while ($nextpage);

say "Completed.";
like image 939
CJ7 Avatar asked Mar 02 '23 11:03

CJ7


1 Answers

Since WWW::Mechanize by default has its user agent keep all history while browsing

  • stack_depth => $value

Sets the depth of the page stack that keeps track of all the downloaded pages. Default is effectively infinite stack size. If the stack is eating up your memory, then set this to a smaller number, say 5 or 10. Setting this to zero means Mech will keep no history.

Thus the object keeps growing. By using Devel::Size qw(total_size) I track the size of $mech to see that it adds tens of kB after each pdf. And the script apparently gets a lot of matches; I quit my test when it gobbled up 10% of memory (and had many dozens of files with over a Gb on disk).

One solution then is to instantiate a new object for, say, each $item. That is wasteful in principle but it doesn't in fact add much overhead while it will limit the maximum size.

Or reset it, or indeed limit its stack depth. Since the code doesn't seem to need to go back to previous states at all there is no need really for any stack, so your solution to drop it is quite fine.

Comments

  • To be precise, there is no "leak" in the script; it just takes more and more memory

  • Always have use strict; and use warnings; at the top of a script

  • It's better to not use indirect object syntax to instantiate an object (new Package), but rather a normal method call (Package->new), to avoid dealing with ambiguities in some cases. See explanation in docs and on this page, and examples of trouble in this post and this post.

like image 155
zdim Avatar answered Mar 05 '23 17:03

zdim