Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy, limit on start_url

Tags:

python

scrapy

I am wondering whether there is a limit on the number of start_urls I can assign to my spider? As far as I've searched, there seems to be no documentation on the limit of the list.

Currently I have set my spider so that the list of start_urls is read in from a csv file. The number of urls is around 1,000,000.

like image 643
Taku Avatar asked Oct 23 '25 23:10

Taku


1 Answers

There isn't a limit per se but you probably want to limit it yourself, otherwise you might end up with memory problems.
What can happen is all those 1M urls will be scheduled to scrapy scheduler and since python objects are quite a bit heavier than plain strings you'll end up running out of memory.

To avoid this you can batch your start urls with spider_idle signal:

class MySpider(Spider):
    name = "spider"
    urls = []
    batch_size = 10000

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = cls(crawler, *args, **kwargs)
        crawler.signals.connect(spider.idle_consume, signals.spider_idle)
        return spider 

    def __init__(self, crawler):
        self.crawler = crawler
        self.urls = [] # read from file

    def start_requests(self):
        for i in range(self.batch_size):
            url = self.urls.pop(0)
            yield Request(url)


    def parse(self, response):
        pass
        # parse

    def idle_consume(self):
        """
        Everytime spider is about to close check our urls 
        buffer if we have something left to crawl
        """
        reqs = self.start_requests()
        if not reqs:
            return
        logging.info('Consuming batch')
        for req in reqs:
            self.crawler.engine.schedule(req, self)
        raise DontCloseSpider
like image 51
Granitosaurus Avatar answered Oct 26 '25 11:10

Granitosaurus



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!