My spider have a serious memory leak.. After 15 min of run its memory 5gb and scrapy tells (using prefs() ) that there 900k requests objects and thats all. What can be the reason for this high number of living requests objects? Request only goes up and doesnt goes down. All other objects are close to zero.
My spider looks like this:
class ExternalLinkSpider(CrawlSpider):
name = 'external_link_spider'
allowed_domains = ['']
start_urls = ['']
rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),)
def parse_obj(self, response):
if not isinstance(response, HtmlResponse):
return
for link in LxmlLinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
if not link.nofollow:
yield LinkCrawlItem(domain=link.url)
Here output of prefs()
HtmlResponse 2 oldest: 0s ago
ExternalLinkSpider 1 oldest: 3285s ago
LinkCrawlItem 2 oldest: 0s ago
Request 1663405 oldest: 3284s ago
Memory for 100k scraped pages can hit 40gb mark on some sites ( for example at victorinox.com it reach 35gb of memory at 100k scraped pages mark). On other its much lesser.
UPD.
There are a few possible issues I see right away.
Before starting though, I wanted to mention that prefs() doesn't show the number of requests queued, it shows the number of Request() objects that are alive. It's possible to reference a request object and keep it alive, even if it's no longer queued to be downloaded.
I don't really see anything in the code you've provided that would cause this, though but you should keep it in mind.
Right off the bat, I'd ask: are you using cookies? If not, sites which pass around a session ID as a GET variable will generate a new sessionID for each page visit. You'll essentially continue queuing up the same pages over and over again. For instance, victorinox.com will have something like "jsessionid=18537CBA2F198E3C1A5C9EE17B6C63AD" in it's URL string, with the ID changing for every new page load.
Second, you may that you're hitting a spider trap. That is, a page which just reloads itself, with a new infinite amount of links. Think of a calendar with a link to "next month" and "previous month". I'm not directly seeing any on victorinox.com, though.
Third, from the provided code your Spider is not constrained to any specific domain. It will extract every link it finds on every page, running parse_obj
on each one. The main page to victorinox.com for instance has a link to http://www.youtube.com/victorinoxswissarmy. This will in turn fill up your requests with tons of YouTube links.
You'll need to troubleshoot more to find out exactly what's going on, though.
Some strategies you may want to use:
If you find you're legitimately just generating to many requests, and memory is an issue, enable the persistent job queue and save the requests to disk, instead. I'd recommend against this as a first step, though, as it's more likely your crawler isn't working as you wanted it to.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With