Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Returning Items in scrapy's start_requests()

Tags:

python

scrapy

I am writing a scrapy spider that takes as input many urls and classifies them into categories (returned as items). These URLs are fed to the spider via my crawler's start_requests() method.

Some URLs can be classified without downloading them, so I would like to yield directly an Item for them in start_requests(), which is forbidden by scrapy. How can I circumvent this?

I have thought about catching these requests in a custom middleware that would turn them into spurious Response objects, that I could then convert into Item objects in the request callback, but any cleaner solution would be welcome.

like image 610
pintoch Avatar asked Oct 19 '22 16:10

pintoch


2 Answers

You could use Downloader Middleware to do this job.

In start_requests(), you should always make a request, for example:

def start_requests(self):
    for url in all_urls:
        yield scrapy.Request(url)

However, you should write a downloader middleware:

class DirectReturn:
    def process_request(self, request, spider):
        image_url = request.url
        if url in direct_return_url_set:
            resp = Response(image_url, request=request)
            request.meta['direct_return_url': True]
            return resp
        else:
            return request

Then, in your parse method, just check if key direct_return_url in response.meta. if yes, just generate an item and put response.url to it and then yield this item.

like image 185
Kingname Avatar answered Nov 03 '22 21:11

Kingname


I think using a spider middleware and overwriting the start_requests() would be a good start.

In your middleware, you should loop over all urls in start_urls, and could use conditional statements to deal with different types of urls.

  • For your special URLs which do not require a request, you can
    • directly call your pipeline's process_item(), do not forget to import your pipeline and create a scrapy.item from your url for this
    • as you mentioned, pass the url as meta in a Request, and have a separate parse function which would only return the url
  • For all remaining URLs, your can launch a "normal" Request as you probably already have defined
like image 25
Ruehri Avatar answered Nov 03 '22 19:11

Ruehri