Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to limit scrapy request objects?

So I have a spider that I thought was leaking memory, turns out it is just grabbing too many links from link rich pages (sometimes it puts upwards of 100,000) when I check the telnet console >>> prefs()

Now I have been over the docs and google again and again and I can't find a way to limit the requests that the spider takes in. What I want is to be able to tell it to hold back on taking requests once a certain amount goes into the scheduler. I have tried setting a DEPTH_LIMIT but that only lets it grab a large amount and then run the callback on the ones that it has grabbed.

It seems like a fairly straightforward thing to do and I am sure people have run into this problem before, so I know there must be a way to get it done. Any ideas?

EDIT: Here is the output from MEMUSAGE_ENABLE = True

     {'downloader/request_bytes': 105716,
     'downloader/request_count': 315,
     'downloader/request_method_count/GET': 315,
     'downloader/response_bytes': 10066538,
     'downloader/response_count': 315,
     'downloader/response_status_count/200': 313,
     'downloader/response_status_count/301': 1,
     'downloader/response_status_count/302': 1,
     'dupefilter/filtered': 32444,
     'finish_reason': 'memusage_exceeded',
     'finish_time': datetime.datetime(2015, 1, 14, 14, 2, 38, 134402),
     'item_scraped_count': 312,
     'log_count/DEBUG': 946,
     'log_count/ERROR': 2,
     'log_count/INFO': 9,
     'memdebug/gc_garbage_count': 0,
     'memdebug/live_refs/EnglishWikiSpider': 1,
     'memdebug/live_refs/Request': 70194,
     'memusage/limit_notified': 1,
     'memusage/limit_reached': 1,
     'memusage/max': 422600704,
     'memusage/startup': 34791424,
     'offsite/domains': 316,
     'offsite/filtered': 18172,
     'request_depth_max': 3,
     'response_received_count': 313,
     'scheduler/dequeued': 315,
     'scheduler/dequeued/memory': 315,
     'scheduler/enqueued': 70508,
     'scheduler/enqueued/memory': 70508,
     'start_time': datetime.datetime(2015, 1, 14, 14, 1, 31, 988254)}
like image 323
Joff Avatar asked Jun 14 '26 19:06

Joff


1 Answers

I solved my problem, the answer was really hard to track down so I posted it here in case anyone else comes across the same problem.

After sifting through scrapy code and referring back to the docs, I could see that scrapy kept all requests in memory, I already deduced that, but in the code there is also some checks to see if there is a job directory in which to write pending requests to disk (in core.scheduler)

So, if you run the scrapy spider with a job directory, it will write pending requests to disk and then retrieve them from disk instead of storing them all in memory.

$ scrapy crawl spider -s JOBDIR=somedirname

when I do this, if I enter the telnet console, I can see that my number of requests in memory is always about 25, and I have 100,000+ written to disk, exactly how I wanted it to run.

It seems like this would be a common problem, given that one would be crawling a large site that has multiple extractable links for every page. I am surprised it is not more documented or easier to find.

http://doc.scrapy.org/en/latest/topics/jobs.html the scrapy site there states that the main purpose is for pausing and resuming later, but it works this way as well.

like image 91
Joff Avatar answered Jun 17 '26 09:06

Joff



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!