Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scrapy how to set referer url

I need to set the referer url, before scraping a site, the site uses refering url based Authentication, so it does not allow me to login if the referer is not valid.

Could someone tell how to do this in Scrapy?

like image 503
vumaasha Avatar asked Oct 25 '12 13:10

vumaasha


People also ask

What is the referer URL?

The address of the webpage where a person clicked a link that sent them to your page. The referrer is the webpage that sends visitors to your site using a link. In other words, it's the webpage that a person was on right before they landed on your page.

Can http referer be changed?

You can change the value of the referrer in the HTTP header using the Web Request API. It requires a background js script for it's use. You can use the onBeforeSendHeaders as it modifies the header before the request is sent.

What is cb_ kwargs?

cb_kwargs. A dictionary that contains arbitrary metadata for this request. Its contents will be passed to the Request's callback as keyword arguments. It is empty for new Requests, which means by default callbacks only get a Response object as argument.

How do I make a Scrapy request?

Making a request is a straightforward process in Scrapy. To generate a request, you need the URL of the webpage from which you want to extract useful data. You also need a callback function. The callback function is invoked when there is a response to the request.


2 Answers

If you want to change the referer in your spider's request, you can change DEFAULT_REQUEST_HEADERS in the settings.py file:

DEFAULT_REQUEST_HEADERS = {
    'Referer': 'http://www.google.com' 
}
like image 154
Cristóbal Morales Avatar answered Feb 11 '23 15:02

Cristóbal Morales


You should do exactly as @warwaruk indicated, below is my example elaboration for a crawl spider:

from scrapy.spiders import CrawlSpider
from scrapy import Request

class MySpider(CrawlSpider):
  name = "myspider"
  allowed_domains = ["example.com"]
  start_urls = [
      'http://example.com/foo'
      'http://example.com/bar'
      'http://example.com/baz'
      ]
  rules = [(...)]

  def start_requests(self):
    requests = []
    for item in self.start_urls:
      requests.append(Request(url=item, headers={'Referer':'http://www.example.com/'}))
    return requests    

  def parse_me(self, response):
    (...)

This should generate following logs in your terminal:

(...)
[myspider] DEBUG: Crawled (200) <GET http://example.com/foo> (referer: http://www.example.com/)
(...)
[myspider] DEBUG: Crawled (200) <GET http://example.com/bar> (referer: http://www.example.com/)
(...)
[myspider] DEBUG: Crawled (200) <GET http://example.com/baz> (referer: http://www.example.com/)
(...)

Will work same with BaseSpider. In the end start_requests method is BaseSpider method, from which CrawlSpider inherits from.

Documentation explains more options to be set in Request apart from headers, such as: cookies , callback function, priority of the request etc.

like image 25
Kulbi Avatar answered Feb 11 '23 15:02

Kulbi