Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to overwrite / use cookies in scrapy

Tags:

python

scrapy

I want to scrap http://www.3andena.com/, this web site starts first in Arabic, and it stores the language settings in cookies. If you tried to access the language version directly through URL (http://www.3andena.com/home.php?sl=en), it makes a problem and return server error.

So, I want to set the cookie value "store_language" to "en", then start scrap the website using this cookie values.

I'm using CrawlSpider with a couple of Rules.

here's the code

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import log
from bkam.items import Product
from scrapy.http import Request
import re

class AndenaSpider(CrawlSpider):
  name = "andena"
  domain_name = "3andena.com"
  start_urls = ["http://www.3andena.com/Kettles/?objects_per_page=10"]

  product_urls = []

  rules = (
     # The following rule is for pagination
     Rule(SgmlLinkExtractor(allow=(r'\?page=\d+$'),), follow=True),
     # The following rule is for produt details
     Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(@class, "products-dialog")]//table//tr[contains(@class, "product-name-row")]/td'), unique=True), callback='parse_product', follow=True),
     )

  def start_requests(self):
    yield Request('http://3andena.com/home.php?sl=en', cookies={'store_language':'en'})

    for url in self.start_urls:
        yield Request(url, callback=self.parse_category)


  def parse_category(self, response):
    hxs = HtmlXPathSelector(response)

    self.product_urls.extend(hxs.select('//td[contains(@class, "product-cell")]/a/@href').extract())

    for product in self.product_urls:
        yield Request(product, callback=self.parse_product)  


  def parse_product(self, response):
    hxs = HtmlXPathSelector(response)
    items = []
    item = Product()

    '''
    some parsing
    '''

    items.append(item)
    return items

SPIDER = AndenaSpider()

Here's the log :

2012-05-30 19:27:13+0000 [andena] DEBUG: Redirecting (301) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://3andena.com/home.php?sl=en>
2012-05-30 19:27:14+0000 [andena] DEBUG: Redirecting (302) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098>
2012-05-30 19:27:14+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/Kettles/?objects_per_page=10> (referer: None)
2012-05-30 19:27:15+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/B-and-D-Concealed-coil-pan-kettle-JC-62.html> (referer: http://www.3andena.com/Kettles/?objects_per_page=10)
like image 266
Mahmoud M. Abdel-Fattah Avatar asked May 19 '12 17:05

Mahmoud M. Abdel-Fattah


3 Answers

modify your codes as below:

def start_requests(self):
    for url in self.start_urls:
        yield Request(url, cookies={'store_language':'en'}, callback=self.parse_category)

Scrapy.Request object accepts optional cookies keyword argument, see documentation here

like image 80
iefreer Avatar answered Nov 09 '22 19:11

iefreer


This is how I do it as of Scrapy 0.24.6:

from scrapy.contrib.spiders import CrawlSpider, Rule

class MySpider(CrawlSpider):

    ...

    def make_requests_from_url(self, url):
        request = super(MySpider, self).make_requests_from_url(url)
        request.cookies['foo'] = 'bar'
        return request

Scrapy calls make_requests_from_url with the URLs in the start_urls attribute of the spider. What the code above is doing is letting the default implementation create the request and then adding a foo cookie that has the value bar. (Or changing the cookie to the value bar if it so happens, against all odds, that there is already a foo cookie on the request produced by the default implementation.)

In case you wonder what happens with requests that are not created from start_urls, let me add that Scrapy's cookie middleware will remember the cookie set with the code above and set it on all future requests that share the same domain as the request on which you explicitly added your cookie.

like image 30
Louis Avatar answered Nov 09 '22 17:11

Louis


Straight from the Scrapy documentation for Requests and Responses.

You'll need something like this

request_with_cookies = Request(url="http://www.3andena.com", cookies={'store_language':'en'})
like image 4
VenkatH Avatar answered Nov 09 '22 17:11

VenkatH