i am new to scrapy and decided to try it out because of good online reviews. I am trying to login to a website with scrapy. I have successfully logged in with a combination of selenium and mechanize by collecting the needed cookies with selenium and adding them to mechanize. Now I am trying to do something similar with scrapy and selenium but cant seem to get anything to work. I cant really even tell if anything is working or not. Can anyone please help me. Below is what Ive started on. I may not even need to transfer the cookies with scrapy but i cant tell if the thing ever actually logs in or not. Thanks
from scrapy.spider import BaseSpider
from scrapy.http import Response,FormRequest,Request
from scrapy.selector import HtmlXPathSelector
from selenium import webdriver
class MySpider(BaseSpider):
name = 'MySpider'
start_urls = ['http://my_domain.com/']
def get_cookies(self):
driver = webdriver.Firefox()
driver.implicitly_wait(30)
base_url = "http://www.my_domain.com/"
driver.get(base_url)
driver.find_element_by_name("USER").clear()
driver.find_element_by_name("USER").send_keys("my_username")
driver.find_element_by_name("PASSWORD").clear()
driver.find_element_by_name("PASSWORD").send_keys("my_password")
driver.find_element_by_name("submit").click()
cookies = driver.get_cookies()
driver.close()
return cookies
def parse(self, response,my_cookies=get_cookies):
return Request(url="http://my_domain.com/",
cookies=my_cookies,
callback=self.login)
def login(self,response):
return [FormRequest.from_response(response,
formname='login_form',
formdata={'USER': 'my_username', 'PASSWORD': 'my_password'},
callback=self.after_login)]
def after_login(self, response):
hxs = HtmlXPathSelector(response)
print hxs.select('/html/head/title').extract()
Using Scrapy to handle token based authentication To do this we scroll to the network tab before login and then simulate a login procedure. All requests will appear down below. Selecting the login name on the left hand side allows us to see the request headers down below.
Selenium is an excellent automation tool and Scrapy is by far the most robust web scraping framework. When we consider web scraping, in terms of speed and efficiency Scrapy is a better choice. While dealing with JavaScript based websites where we need to make AJAX/PJAX requests, Selenium can work better.
Scrapy has a downloader middleware CookiesMiddleware implemented to support cookies. You just need to enable it. It mimics how the cookiejar in browser works.
Your question is more of debug issue, so my answer will have just some notes on your question, not the exact answer.
def parse(self, response,my_cookies=get_cookies):
return Request(url="http://my_domain.com/",
cookies=my_cookies,
callback=self.login)
my_cookies=get_cookies
- you are assigning a function here, not the result it returns. I think you don't need to pass any function here as parameter at all. It should be:
def parse(self, response):
return Request(url="http://my_domain.com/",
cookies=self.get_cookies(),
callback=self.login)
cookies
argument for Request
should be a dict - please verify it is indeed a dict.
I cant really even tell if anything is working or not.
Put some prints in the callbacks to follow the execution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With