I want to scrap http://www.3andena.com/, this web site starts first in Arabic, and it stores the language settings in cookies. If you tried to access the language version directly through URL (http://www.3andena.com/home.php?sl=en), it makes a problem and return server error.
So, I want to set the cookie value "store_language" to "en", then start scrap the website using this cookie values.
I'm using CrawlSpider with a couple of Rules.
here's the code
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import log
from bkam.items import Product
from scrapy.http import Request
import re
class AndenaSpider(CrawlSpider):
name = "andena"
domain_name = "3andena.com"
start_urls = ["http://www.3andena.com/Kettles/?objects_per_page=10"]
product_urls = []
rules = (
# The following rule is for pagination
Rule(SgmlLinkExtractor(allow=(r'\?page=\d+$'),), follow=True),
# The following rule is for produt details
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(@class, "products-dialog")]//table//tr[contains(@class, "product-name-row")]/td'), unique=True), callback='parse_product', follow=True),
)
def start_requests(self):
yield Request('http://3andena.com/home.php?sl=en', cookies={'store_language':'en'})
for url in self.start_urls:
yield Request(url, callback=self.parse_category)
def parse_category(self, response):
hxs = HtmlXPathSelector(response)
self.product_urls.extend(hxs.select('//td[contains(@class, "product-cell")]/a/@href').extract())
for product in self.product_urls:
yield Request(product, callback=self.parse_product)
def parse_product(self, response):
hxs = HtmlXPathSelector(response)
items = []
item = Product()
'''
some parsing
'''
items.append(item)
return items
SPIDER = AndenaSpider()
Here's the log :
2012-05-30 19:27:13+0000 [andena] DEBUG: Redirecting (301) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://3andena.com/home.php?sl=en>
2012-05-30 19:27:14+0000 [andena] DEBUG: Redirecting (302) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098>
2012-05-30 19:27:14+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/Kettles/?objects_per_page=10> (referer: None)
2012-05-30 19:27:15+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/B-and-D-Concealed-coil-pan-kettle-JC-62.html> (referer: http://www.3andena.com/Kettles/?objects_per_page=10)
modify your codes as below:
def start_requests(self):
for url in self.start_urls:
yield Request(url, cookies={'store_language':'en'}, callback=self.parse_category)
Scrapy.Request object accepts optional cookies
keyword argument, see documentation here
This is how I do it as of Scrapy 0.24.6:
from scrapy.contrib.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
...
def make_requests_from_url(self, url):
request = super(MySpider, self).make_requests_from_url(url)
request.cookies['foo'] = 'bar'
return request
Scrapy calls make_requests_from_url
with the URLs in the start_urls
attribute of the spider. What the code above is doing is letting the default implementation create the request and then adding a foo
cookie that has the value bar
. (Or changing the cookie to the value bar
if it so happens, against all odds, that there is already a foo
cookie on the request produced by the default implementation.)
In case you wonder what happens with requests that are not created from start_urls
, let me add that Scrapy's cookie middleware will remember the cookie set with the code above and set it on all future requests that share the same domain as the request on which you explicitly added your cookie.
Straight from the Scrapy documentation for Requests and Responses.
You'll need something like this
request_with_cookies = Request(url="http://www.3andena.com", cookies={'store_language':'en'})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With