Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add Headers to Scrapy CrawlSpider Requests?

Tags:

python

scrapy

I'm working with the CrawlSpider class to crawl a website and I would like to modify the headers that are sent in each request. Specifically, I would like to add the referer to the request.

As per this question, I checked

response.request.headers.get('Referer', None)

in my response parsing function and the Referer header is not present. I assume that means the Referer is not being submitted in the request (unless the website doesn't return it, I'm not sure on that).

I haven't been able to figure out how to modify the headers of a request. Again, my spider is derived from CrawlSpider. Overriding CrawlSpider's _requests_to_follow or specifying a process_request callback for a rule will not work because the referer is not in scope at those times.

Does anyone know how to modify request headers dynamically?

like image 947
CatShoes Avatar asked Jan 08 '13 16:01

CatShoes


People also ask

How do you add a header on Scrapy?

You need to set the user agent which Scrapy allows you to do directly. import scrapy class QuotesSpider(scrapy. Spider): # ... user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.

What is meta in Scrapy request?

meta. A dict that contains arbitrary metadata for this request. This dict is empty for new Requests, and is usually populated by different Scrapy components (extensions, middlewares, etc). So the data contained in this dict depends on the extensions you have enabled. See Request.

How do you get a response from Scrapy request?

You can use the FormRequest. from_response() method for this job. Here's an example spider which uses it: import scrapy def authentication_failed(response): # TODO: Check the contents of the response and return True if it failed # or False if it succeeded.

What is middleware in Scrapy?

The spider middleware is a framework of hooks into Scrapy's spider processing mechanism where you can plug custom functionality to process the responses that are sent to Spiders for processing and to process the requests and items that are generated from spiders.


2 Answers

I hate to answer my own question, but I found out how to do it. You have to enable the SpiderMiddleware that will populate the referer for responses. See the documentation for scrapy.contrib.spidermiddleware.referer.RefererMiddleware

In short, you need to add this middleware to your project's settings file.

SPIDER_MIDDLEWARES = {
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
}

Then in your response parsing method you can use, response.request.headers.get('Referrer', None), to get the referer.

If you understand these middlewares right away, read them again, take a break, and then read them again. I found them to be very confusing.

like image 89
CatShoes Avatar answered Sep 27 '22 17:09

CatShoes


You can pass REFERER manually to each request using headers argument:

yield Request(parse=..., headers={'referer':...})

RefererMiddleware does the same, automatically taking the referrer url from the previous response.

like image 39
warvariuc Avatar answered Sep 27 '22 17:09

warvariuc