I have a large file of relative urls that I want to scrape with Scrapy, and I've written some code to read this file line-by-line and build requests for my spider to parse. Below is some sample code.
spider:
def start_requests(self):
    with open(self._file) as infile:
        for line in infile:
            inlist = line.replace("\n","").split(",")
            item = MyItem(data = inlist[0])
            request = scrapy.Request(
                url = "http://foo.org/{0}".format(item["data"]),
                callback = self.parse_some_page
            )
            request.meta["item"]
            yield request
def parse_some_page(self,response):
    ...
    request = scrapy.Request(
        url = "http://foo.org/bar",
        callback = self.parse_some_page2
    )
    yield request
This works fine, but with a large input file, I'm seeing that parse_some_page2 isn't invoked until start_requests finishes yielding all the initial requests. Is there some way I can make Scrapy start invoking the callbacks earlier? Ultimately, I don't want to wait for a million requests before I start seeing items flow through the pipeline.
I came up with 2 solutions. 1) Run spiders in separate processes if there are too many large sites. 2) Use deferreds and callbacks via Twisted (please don't run away, it won't be too scary). I'll discuss how to use the 2nd method because the first one can simply be googled.
Every function that executes yield request will "block" until a result is available. So your parse_some_page() function yields a scrapy response object and will not go on to the next URL until a response is returned. I did manage to find some sites (mostly foreign government sites) that take a while to fetch and hopefully it simulates a similar situation you're experiencing. Here is a quick and easy example:
# spider/stackoverflow_spider.py
from twisted.internet import defer
import scrapy
class StackOverflow(scrapy.Spider):
    name = 'stackoverflow'
    def start_requests(self):
        urls = [
            'http://www.gob.cl/en/',
            'http://www.thaigov.go.th/en.html',
            'https://www.yahoo.com',
            'https://www.stackoverflow.com',
            'https://swapi.co/',
        ]
        for index, url in enumerate(urls):
            # create callback chain after a response is returned
            deferred = defer.Deferred()
            deferred.addCallback(self.parse_some_page)
            deferred.addCallback(self.write_to_disk, url=url, filenumber=index+1)
            # add callbacks and errorbacks as needed
            yield scrapy.Request(
                url=url,
                callback=deferred.callback)     # this func will start the callback chain AFTER a response is returned
    def parse_some_page(self, response):
        print('[1] Parsing %s' % (response.url))
        return response.body    # this will be passed to the next callback
    def write_to_disk(self, content, url, filenumber):
        print('[2] Writing %s content to disk' % (url))
        filename = '%d.html' % filenumber
        with open(filename, 'wb') as f:
            f.write(content)
        # return what you want to pass to the next callback function
        # or raise an error and start Errbacks chain
I've changed things slightly to be a bit easier to read and run. The first thing to take note of in start_requests() is that Deferred objects are created and callback functions are being chained (via addCallback()) within the urls loop. Now take a look at the callback parameter for scrapy.Request:
yield scrapy.Request(
    url=url,
    callback=deferred.callback)
What this snippet will do is start the callback chain immediately after scrapy.Response becomes available from the request. In Twisted, Deferreds start running callback chains only after Deferred.callback(result) is executed with a value.
After a response is provided, the parse_some_page() function will run with the Response as an argument. What you will do is extract what ever you need here and pass it to the next callback (ie. write_to_disk() my example). You can add more callbacks to the Deferred in the loop if necessary.
So the difference between this answer and what you did originally is that you used yield to wait for all the responses first, then execute callbacks. Where as my method uses Deferred.callback() as the callback for each request such that each response will be processed immediately.
Hopefully this helps (and/or works).
I have no clue if this will actually work for you since I couldn't find a site that is too large to parse. Also, I'm brand-spankin' new at Scrapy :D but I have years of Twisted under my belt.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With