Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How "download_slot" works within scrapy

I'v created a script in scrapy to parse the author name of different posts from it's landing page and then pass it to the parse_page method using meta keyword in order to print the post content along with the author name at the same time.

I've used download_slot within meta keyword which allegedly maskes the script run faster. Although it is not necessary to comply with the logic I tried to apply here, I would like to stick to it only to understand how download_slot works within any script and why. I searched a lot to know more about download_slot but I end up some links like this one.

An example usage of download_slot (I'm not quite sure about it though):

from scrapy.crawler import CrawlerProcess
from scrapy import Request
import scrapy

class ConventionSpider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']

    def parse(self,response):
        for link in response.css('.summary'):
            name = link.css('.user-details a::text').extract_first()
            url = link.css('.question-hyperlink::attr(href)').extract_first()
            nurl = response.urljoin(url)
            yield Request(nurl,callback=self.parse_page,meta={'item':name,"download_slot":name})

    def parse_page(self,response):
        elem = response.meta.get("item")
        post = ' '.join([item for item in response.css("#question .post-text p::text").extract()])
        yield {'Name':elem,'Main_Content':post}

if __name__ == "__main__":
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',
    })
    process.crawl(ConventionSpider)
    process.start()

The above script runs flawlessly.

My question: how download_slot works within scrapy?

like image 385
MITHU Avatar asked Apr 26 '19 02:04

MITHU


1 Answers

Let's start with the Scrapy architecture. When you create a scrapy.Request, the Scrapy engine passes the request to the downloader to fetch the content. The downloader puts incoming requests into slots which you can imagine as independent queues of requests. The queues are then polled and each individual request gets processed (the content gets downloaded).

Now, here's the crucial part. To determine into what slot to put the incoming request, downloader checks request.meta for download_slot key. If it's present, it puts the request into the slot with that name (and creates it if it doesn't yet exist). If the download_slot key is not present, it puts the request into the slot for the domain (more accurately, the hostname) the request's URL points to.

This explains why your script runs faster. You create multiple downloader slots because they are based on the author's name. If you did not, they would be put into the same slot based on the domain (which is always stackoverflow.com). Thus, you effectively increase the parallelism of downloading content.

This explanation is a little bit simplified but it should give you a picture of what's going on. You can check the code yourself.

like image 189
Tomáš Linhart Avatar answered Oct 04 '22 20:10

Tomáš Linhart