Scrapy Clusters is awesome. It can be used to perform huge, continuous crawls using Redis and Kafka. It's really durible but I'm still trying to figure out the finer details of the best logic for my specific needs.
In using Scrapy Clusters I'm able to set up three levels of spiders that sequentially receive urls from one another like so:
site_url_crawler >>> gallery_url_crawler >>> content_crawler
(site_crawler would give something like cars.com/gallery/page:1 to gallery_url_crawler. gallery_url_crawler would give maybe 12 urls to content_crawler that might look like cars.com/car:1234, cars.com/car:1235, cars.com/car:1236, etc. And content_crawler would gather the all-important data we want.)
I can do this by adding to gallery_url_crawler.py
req = scrapy.Request(url)
for key in response.meta.keys():
req.meta[key] = response.meta[key]
req.meta['spiderid']= 'content_crawler1'
req.meta['crawlid'] = 'site1'
yield req
With this strategy I can feed urls from one crawler to another without having to wait for the subsequent crawl to complete. This then creates a queue. To fully utilize Clusters I hope to add more crawlers wherever there is a bottleneck. In this work-flow the bottleneck is at the end, when scraping the content. So I experimented with this:
site_url_crawler >>> gallery_url_crawler >>> content_crawler + content_crawler + content_crawler
For lack of a better illustration I was just trying to show I used three instances of that final spider to handle the longer queue.
BUT it seems that each instance of the content_crawler waited patiently for the current content_crawler to complete. Hence, no boost in productivity.
A final idea I had was something like this:
site_url_crawler >>> gallery_url_crawler >>> content_crawler1 + content_crawler2 + content_crawler3
So I tried to use separate spiders to receive the final queue.
Unfortunately, I could not experiment with this since I could not pass the kafka message to demo.inbound like so in gallery_url_crawler.py
:
req = scrapy.Request(url)
for key in response.meta.keys():
req.meta[key] = response.meta[key]
req.meta['spiderid']= 'content_crawler1'
req.meta['spiderid']= 'content_crawler2'
req.meta['crawlid'] = 'site1'
yield req
(Notice the extra spiderid)The above did not work because I think it can not assign a single message to two different spiders... And
req1 = scrapy.Request(url)
req2 = scrapy.Request(url)
for key in response.meta.keys():
req1.meta[key] = response.meta[key]
req1.meta['spiderid']= 'content_crawler1'
req1.meta['crawlid'] = 'site1'
for key2 in response.meta.keys():
req2.meta[key2] = response.meta[key2]
req2.meta['spiderid']= 'content_crawler2'
req2.meta['crawlid'] = 'site1'
yield req1
yield req2
Did not work I think because the dupefilter kicked out the second one because it saw it as a dupe.
Anyway, I just hope to ultimately use Clusters in a way that can allow me to fire up instances of multiple spiders at anytime, pull from the queue, and repeat.
It turns out that distributing the urls is based on IP addresses. Once I stood up the cluster on separate machines ie. different machines for each spider the urls flowed and were all taking from the queue.
http://scrapy-cluster.readthedocs.org/en/latest/topics/crawler/controlling.html
Scrapy Cluster comes with two major strategies for controlling how fast your pool of spiders hit different domains. This is determined by spider type and/or IP Address, but both act upon the different Domain Queues.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With