I need to set the referer url, before scraping a site, the site uses refering url based Authentication, so it does not allow me to login if the referer is not valid. Could someone tell how to do this in Scrapy?

If you want to change the referer in your spider's request, you can change <code>DEFAULT_REQUEST_HEADERS</code> in the settings.py file: <pre class="prettyprint lang-python prettyprint-override"><code>DEFAULT_REQUEST_HEADERS = { 'Referer': 'http://www.google.com' } </code></pre>

You should do exactly as @warwaruk indicated, below is my example elaboration for a crawl spider: <pre class="prettyprint"><code>from scrapy.spiders import CrawlSpider from scrapy import Request class MySpider(CrawlSpider): name = "myspider" allowed_domains = ["example.com"] start_urls = [ 'http://example.com/foo' 'http://example.com/bar' 'http://example.com/baz' ] rules = [(...)] def start_requests(self): requests = [] for item in self.start_urls: requests.append(Request(url=item, headers={'Referer':'http://www.example.com/'})) return requests def parse_me(self, response): (...) </code></pre> This should generate following logs in your terminal: <pre class="prettyprint"><code>(...) [myspider] DEBUG: Crawled (200) <GET http://example.com/foo> (referer: http://www.example.com/) (...) [myspider] DEBUG: Crawled (200) <GET http://example.com/bar> (referer: http://www.example.com/) (...) [myspider] DEBUG: Crawled (200) <GET http://example.com/baz> (referer: http://www.example.com/) (...) </code></pre> Will work same with BaseSpider. In the end start_requests method is BaseSpider method, from which CrawlSpider inherits from. Documentation explains more options to be set in Request apart from headers, such as: cookies , callback function, priority of the request etc.

scrapy how to set referer url

2 Answers

If you want to change the referer in your spider's request, you can change DEFAULT_REQUEST_HEADERS in the settings.py file:

DEFAULT_REQUEST_HEADERS = {
    'Referer': 'http://www.google.com' 
}

154

answered Feb 11 '23 15:02

Cristóbal Morales

You should do exactly as @warwaruk indicated, below is my example elaboration for a crawl spider:

from scrapy.spiders import CrawlSpider
from scrapy import Request

class MySpider(CrawlSpider):
  name = "myspider"
  allowed_domains = ["example.com"]
  start_urls = [
      'http://example.com/foo'
      'http://example.com/bar'
      'http://example.com/baz'
      ]
  rules = [(...)]

  def start_requests(self):
    requests = []
    for item in self.start_urls:
      requests.append(Request(url=item, headers={'Referer':'http://www.example.com/'}))
    return requests    

  def parse_me(self, response):
    (...)

This should generate following logs in your terminal:

(...)
[myspider] DEBUG: Crawled (200) <GET http://example.com/foo> (referer: http://www.example.com/)
(...)
[myspider] DEBUG: Crawled (200) <GET http://example.com/bar> (referer: http://www.example.com/)
(...)
[myspider] DEBUG: Crawled (200) <GET http://example.com/baz> (referer: http://www.example.com/)
(...)

Will work same with BaseSpider. In the end start_requests method is BaseSpider method, from which CrawlSpider inherits from.

Documentation explains more options to be set in Request apart from headers, such as: cookies , callback function, priority of the request etc.

answered Feb 11 '23 15:02

Kulbi

Related questions
                            
                                What's the requests/second standard for scraping websites?
                            
                                Interacting with web pages in C#
                            
                                Parsing meta tags efficiently with lxml?
                            
                                Rotating Proxies for web scraping
                            
                                PDF scraping using R
                            
                                How to set value of hidden form in Mechanize/Python?
                            
                                Pass the user-agent through webdriver in Selenium
                            
                                Memory leak in Node.js scraper
                            
                                Websites that are particularly challenging to crawl and scrape? [closed]
                            
                                Obtaining static HTML files from Wikipedia XML dump
                            
                                Python Scraping JavaScript using Selenium and Beautiful Soup
                            
                                Where is the memory leak? How to timeout threads during multiprocessing in python?
                            
                                Excluding unwanted results of findAll using BeautifulSoup
                            
                                Ruby alternative to Scrapy? [closed]
                            
                                How to use Goutte
                            
                                Unit testing screen scraper
                            
                                Run multiple scrapy spiders at once using scrapyd
                            
                                Nokogiri: how to find a div by id and see what text it contains?
                            
                                Extracting table contents from html with python and BeautifulSoup
                            
                                How can i grab CData out of BeautifulSoup

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

scrapy how to set referer url

Tags:

scrapy

screen-scraping

vumaasha

People also ask

2 Answers

Cristóbal Morales

Kulbi

Recent Activity

Donate For Us