Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy Clusters Distributed Crawl Strategy

Scrapy Clusters is awesome. It can be used to perform huge, continuous crawls using Redis and Kafka. It's really durible but I'm still trying to figure out the finer details of the best logic for my specific needs.

In using Scrapy Clusters I'm able to set up three levels of spiders that sequentially receive urls from one another like so:

site_url_crawler >>> gallery_url_crawler >>> content_crawler

(site_crawler would give something like cars.com/gallery/page:1 to gallery_url_crawler. gallery_url_crawler would give maybe 12 urls to content_crawler that might look like cars.com/car:1234, cars.com/car:1235, cars.com/car:1236, etc. And content_crawler would gather the all-important data we want.)

I can do this by adding to gallery_url_crawler.py

    req = scrapy.Request(url)
    for key in response.meta.keys():

        req.meta[key] = response.meta[key]
        req.meta['spiderid']= 'content_crawler1'
        req.meta['crawlid'] = 'site1'

    yield req   

With this strategy I can feed urls from one crawler to another without having to wait for the subsequent crawl to complete. This then creates a queue. To fully utilize Clusters I hope to add more crawlers wherever there is a bottleneck. In this work-flow the bottleneck is at the end, when scraping the content. So I experimented with this:

site_url_crawler >>> gallery_url_crawler >>> content_crawler + content_crawler + content_crawler

For lack of a better illustration I was just trying to show I used three instances of that final spider to handle the longer queue.

BUT it seems that each instance of the content_crawler waited patiently for the current content_crawler to complete. Hence, no boost in productivity.

A final idea I had was something like this:

site_url_crawler >>> gallery_url_crawler >>> content_crawler1 + content_crawler2 + content_crawler3

So I tried to use separate spiders to receive the final queue.

Unfortunately, I could not experiment with this since I could not pass the kafka message to demo.inbound like so in gallery_url_crawler.py:

    req = scrapy.Request(url)
    for key in response.meta.keys():

        req.meta[key] = response.meta[key]
        req.meta['spiderid']= 'content_crawler1'
        req.meta['spiderid']= 'content_crawler2'
        req.meta['crawlid'] = 'site1'

    yield req   

(Notice the extra spiderid)The above did not work because I think it can not assign a single message to two different spiders... And

    req1 = scrapy.Request(url)
    req2 = scrapy.Request(url)
    for key in response.meta.keys():

        req1.meta[key] = response.meta[key]
        req1.meta['spiderid']= 'content_crawler1'           
        req1.meta['crawlid'] = 'site1'

    for key2 in response.meta.keys():
        req2.meta[key2] = response.meta[key2]
        req2.meta['spiderid']= 'content_crawler2'
        req2.meta['crawlid'] = 'site1'
    yield req1
    yield req2

Did not work I think because the dupefilter kicked out the second one because it saw it as a dupe.

Anyway, I just hope to ultimately use Clusters in a way that can allow me to fire up instances of multiple spiders at anytime, pull from the queue, and repeat.

like image 445
Liam Hanninen Avatar asked Nov 08 '22 18:11

Liam Hanninen


1 Answers

It turns out that distributing the urls is based on IP addresses. Once I stood up the cluster on separate machines ie. different machines for each spider the urls flowed and were all taking from the queue.

http://scrapy-cluster.readthedocs.org/en/latest/topics/crawler/controlling.html

Scrapy Cluster comes with two major strategies for controlling how fast your pool of spiders hit different domains. This is determined by spider type and/or IP Address, but both act upon the different Domain Queues.

like image 161
Liam Hanninen Avatar answered Nov 15 '22 07:11

Liam Hanninen