Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy spider memory leak

My spider have a serious memory leak.. After 15 min of run its memory 5gb and scrapy tells (using prefs() ) that there 900k requests objects and thats all. What can be the reason for this high number of living requests objects? Request only goes up and doesnt goes down. All other objects are close to zero.

My spider looks like this:

class ExternalLinkSpider(CrawlSpider):
  name = 'external_link_spider'
  allowed_domains = ['']
  start_urls = ['']

  rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),)

  def parse_obj(self, response):
    if not isinstance(response, HtmlResponse):
        return
    for link in LxmlLinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
        if not link.nofollow:
            yield LinkCrawlItem(domain=link.url)

Here output of prefs()

HtmlResponse                        2   oldest: 0s ago 
ExternalLinkSpider                  1   oldest: 3285s ago
LinkCrawlItem                       2   oldest: 0s ago
Request                        1663405   oldest: 3284s ago

Memory for 100k scraped pages can hit 40gb mark on some sites ( for example at victorinox.com it reach 35gb of memory at 100k scraped pages mark). On other its much lesser.

UPD.

Objgraph for oldest request after some time of run

enter image description here

like image 252
Aldarund Avatar asked Jul 23 '15 17:07

Aldarund


1 Answers

There are a few possible issues I see right away.

Before starting though, I wanted to mention that prefs() doesn't show the number of requests queued, it shows the number of Request() objects that are alive. It's possible to reference a request object and keep it alive, even if it's no longer queued to be downloaded.

I don't really see anything in the code you've provided that would cause this, though but you should keep it in mind.

Right off the bat, I'd ask: are you using cookies? If not, sites which pass around a session ID as a GET variable will generate a new sessionID for each page visit. You'll essentially continue queuing up the same pages over and over again. For instance, victorinox.com will have something like "jsessionid=18537CBA2F198E3C1A5C9EE17B6C63AD" in it's URL string, with the ID changing for every new page load.

Second, you may that you're hitting a spider trap. That is, a page which just reloads itself, with a new infinite amount of links. Think of a calendar with a link to "next month" and "previous month". I'm not directly seeing any on victorinox.com, though.

Third, from the provided code your Spider is not constrained to any specific domain. It will extract every link it finds on every page, running parse_obj on each one. The main page to victorinox.com for instance has a link to http://www.youtube.com/victorinoxswissarmy. This will in turn fill up your requests with tons of YouTube links.

You'll need to troubleshoot more to find out exactly what's going on, though.

Some strategies you may want to use:

  1. Create a new Downloader Middleware and log all of your requests (to a file, or database). Review the requests for odd behaviour.
  2. Limit the Depth to prevent it from continuing down the rabbit hole infinitely.
  3. Limit the domain to test if it's still a problem.

If you find you're legitimately just generating to many requests, and memory is an issue, enable the persistent job queue and save the requests to disk, instead. I'd recommend against this as a first step, though, as it's more likely your crawler isn't working as you wanted it to.

like image 95
Rejected Avatar answered Nov 03 '22 03:11

Rejected