Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I make start_url in scrapy to consume from a message queue?

Tags:

python

scrapy

I am building a scrapy project in which I have multiple spiders( A spider for each domain). Now, the urls to be scraped come dynamically from a user given query. so basically I do not need to do broad crawls or even follow links. there will be urls coming one after the other and I just need to extract using selectors. So I was thinking if I could just pass the URLs onto a message queue which the scrapy spider could consume from, I'd be fine. But I am not able to figure it out. I have checked

https://github.com/darkrho/scrapy-redis

but i feel its not suitable for my purposes as I need multiple queues( a single queue for each spider). As I have come to learn, one way seems to be to override the start_requests method in the spider. But here again I am not clear on what to do (new to python and scrapy). Could I just treat it as any normal python script and ovverride the method to use a(any) message queue? Also, i need the spider(s) running 24*7 and scrape whenever there is a request on the queue. I figured I should use signals and raise the DontCloseSpider exception somewhere. but where do I do that? I am pretty lost. Please help.

Here's the scenario I am looking at:

User-> Query -> url from abc.com -> abc-spider

          -> url from xyz.com -> xyz-spider

          -> url from ghi.com -> ghi-spider

Now each url has the same thing to be scraped for each website. So i have selectors doing that in each spider. What i need is, this is just a single user scenario. when there are multiople users, there ll be multiple unrelated urls coming in for the same spider. so it will be something like this:

query1,query2, query3

abc.com -> url_abc1,url_abc2,url_abc3

xyz.com -> url_xyz1,url_xyz2,url_xyz3

ghi.com -> url_ghi1,url_ghi2, url_ghi3

so for each website, these urls will be coming dynamically which would be pushed onto their respective message queues. now each of the spiders meant for the website must consume their respective queue and give me the scraped items when there is a request on the message queue

like image 999
Avinragh Avatar asked Sep 22 '14 04:09

Avinragh


1 Answers

This is a very common and (IMO) excellent way to build scrapy as part of a data pipeline; I do it all the time.

You are correct that you want to override the spider's start_requests() method. I don't know how scrapy behaves if you have start_requests() defined as well as start_urls variable, but I'd recommend just using start_requests() if you're consuming from a dynamic source like a database.

Some example code, untested but should give you the right idea.. Please let me know if you need more information. It also assumes your queue is populated by another process.

class ProfileSpider( scrapy.Spider ):
    name = 'myspider'

    def start_requests( self ):
        while( True ):
            yield self.make_requests_from_url(
                self._pop_queue()
            )

    def _pop_queue( self ):
        while( True ):
            yield self.queue.read()

This exposes your queue as a generator. If you want to minimize the amount of empty looping (because the queue could be empty a lot of the time), you can add a sleep command or exponential backoff in the _pop_queue loop. (If queue is empty, sleep for a few seconds and try to pop again.)

Assuming no fatal errors happen in your code, I believe this shouldn't terminate because of the construction of the loops / generators.

like image 128
Travis Leleu Avatar answered Oct 13 '22 21:10

Travis Leleu