Crawling with an authenticated session in Scrapy

Tags:

In my previous question, I wasn't very specific over my problem (scraping with an authenticated session with Scrapy), in the hopes of being able to deduce the solution from a more general answer. I should probably rather have used the word crawling.

So, here is my code so far:

class MySpider(CrawlSpider):     name = 'myspider'     allowed_domains = ['domain.com']     start_urls = ['http://www.domain.com/login/']      rules = (         Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True),     )      def parse(self, response):         hxs = HtmlXPathSelector(response)         if not "Hi Herman" in response.body:             return self.login(response)         else:             return self.parse_item(response)      def login(self, response):         return [FormRequest.from_response(response,                     formdata={'name': 'herman', 'password': 'password'},                     callback=self.parse)]       def parse_item(self, response):         i['url'] = response.url          # ... do more things          return i

As you can see, the first page I visit is the login page. If I'm not authenticated yet (in the parse function), I call my custom login function, which posts to the login form. Then, if I am authenticated, I want to continue crawling.

The problem is that the parse function I tried to override in order to log in, now no longer makes the necessary calls to scrape any further pages (I'm assuming). And I'm not sure how to go about saving the Items that I create.

Anyone done something like this before? (Authenticate, then crawl, using a CrawlSpider) Any help would be appreciated.

479

asked May 01 '11 20:05

Herman Schaaf

2 Answers

In order for the above solution to work, I had to make CrawlSpider inherit from InitSpider, and no longer from BaseSpider by changing, on the scrapy source code, the following. In file scrapy/contrib/spiders/crawl.py:

add: from scrapy.contrib.spiders.init import InitSpider
change class CrawlSpider(BaseSpider) to class CrawlSpider(InitSpider)

Otherwise the spider wouldn't call the init_request method.

Is there any other easier way?

answered Oct 17 '22 04:10

viniciusnz

Do not override the parse function in a CrawlSpider:

When you are using a CrawlSpider, you shouldn't override the parse function. There's a warning in the CrawlSpider documentation here: http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.Rule

This is because with a CrawlSpider, parse (the default callback of any request) sends the response to be processed by the Rules.

Logging in before crawling:

In order to have some kind of initialisation before a spider starts crawling, you can use an InitSpider (which inherits from a CrawlSpider), and override the init_request function. This function will be called when the spider is initialising, and before it starts crawling.

In order for the Spider to begin crawling, you need to call self.initialized.

You can read the code that's responsible for this here (it has helpful docstrings).

An example:

from scrapy.contrib.spiders.init import InitSpider from scrapy.http import Request, FormRequest from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.spiders import Rule  class MySpider(InitSpider):     name = 'myspider'     allowed_domains = ['example.com']     login_page = 'http://www.example.com/login'     start_urls = ['http://www.example.com/useful_page/',                   'http://www.example.com/another_useful_page/']      rules = (         Rule(SgmlLinkExtractor(allow=r'-\w+.html$'),              callback='parse_item', follow=True),     )      def init_request(self):         """This function is called before crawling starts."""         return Request(url=self.login_page, callback=self.login)      def login(self, response):         """Generate a login request."""         return FormRequest.from_response(response,                     formdata={'name': 'herman', 'password': 'password'},                     callback=self.check_login_response)      def check_login_response(self, response):         """Check the response returned by a login request to see if we are         successfully logged in.         """         if "Hi Herman" in response.body:             self.log("Successfully logged in. Let's start crawling!")             # Now the crawling can begin..             return self.initialized()         else:             self.log("Bad times :(")             # Something went wrong, we couldn't log in, so nothing happens.      def parse_item(self, response):          # Scrape data from page

Saving items:

Items your Spider returns are passed along to the Pipeline which is responsible for doing whatever you want done with the data. I recommend you read the documentation: http://doc.scrapy.org/en/0.14/topics/item-pipeline.html

If you have any problems/questions in regards to Items, don't hesitate to pop open a new question and I'll do my best to help.

127

answered Oct 17 '22 05:10

Acorn

Related questions
                            
                                Is it possible to pass arguments into event bindings?
                            
                                How do I access part of a list in Jinja2
                            
                                python yaml.dump bad indentation
                            
                                Why don't I have xlrd?
                            
                                Scipy/Numpy FFT Frequency Analysis
                            
                                Capturing repeating subpatterns in Python regex
                            
                                How to create a commit and push into repo with GitHub API v3?
                            
                                Getting all field names from a protocol buffer?
                            
                                Repeating each element of a numpy array 5 times
                            
                                ValueError: Layer sequential_20 expects 1 inputs, but it received 2 input tensors
                            
                                What is internal representation of string in Python 3.x
                            
                                Get window position & size with python
                            
                                Is it possible to dereference variable id's?
                            
                                Travis special requirements for each python version
                            
                                sqlalchemy: create relations but without foreign key constraint in db?
                            
                                Python - Drop row if two columns are NaN
                            
                                returning numpy arrays via pybind11
                            
                                pip3 on python3.9 fails on 'HTMLParser' object has no attribute 'unescape' [duplicate]
                            
                                Listing serial (COM) ports on Windows?
                            
                                select from sqlite table where rowid in list using python sqlite3 — DB-API 2.0

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Crawling with an authenticated session in Scrapy

Tags:

python

scrapy

Herman Schaaf

People also ask

2 Answers

viniciusnz

Acorn

Recent Activity

Donate For Us