Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Crawling with an authenticated session in Scrapy

Tags:

python

scrapy

In my previous question, I wasn't very specific over my problem (scraping with an authenticated session with Scrapy), in the hopes of being able to deduce the solution from a more general answer. I should probably rather have used the word crawling.

So, here is my code so far:

class MySpider(CrawlSpider):     name = 'myspider'     allowed_domains = ['domain.com']     start_urls = ['http://www.domain.com/login/']      rules = (         Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True),     )      def parse(self, response):         hxs = HtmlXPathSelector(response)         if not "Hi Herman" in response.body:             return self.login(response)         else:             return self.parse_item(response)      def login(self, response):         return [FormRequest.from_response(response,                     formdata={'name': 'herman', 'password': 'password'},                     callback=self.parse)]       def parse_item(self, response):         i['url'] = response.url          # ... do more things          return i 

As you can see, the first page I visit is the login page. If I'm not authenticated yet (in the parse function), I call my custom login function, which posts to the login form. Then, if I am authenticated, I want to continue crawling.

The problem is that the parse function I tried to override in order to log in, now no longer makes the necessary calls to scrape any further pages (I'm assuming). And I'm not sure how to go about saving the Items that I create.

Anyone done something like this before? (Authenticate, then crawl, using a CrawlSpider) Any help would be appreciated.

like image 479
Herman Schaaf Avatar asked May 01 '11 20:05

Herman Schaaf


People also ask

What does Scrapy crawl do?

Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them. Scrapy provides a built-in mechanism for extracting data (called selectors) but you can easily use BeautifulSoup (or lxml) instead, if you feel more comfortable working with them.

Which command is used to crawl the data from website using Scrapy library?

You have to run a crawler on the web page using the fetch command in the Scrapy shell. A crawler or spider goes through a webpage downloading its text and metadata.

Is Scrapy good for web scraping?

Scrapy, being one of the most popular web scraping frameworks, is a great choice if you want to learn how to scrape data from the web. In this tutorial, you'll learn how to get started with Scrapy and you'll also implement an example project to scrape an e-commerce website.

What is Start_urls in Scrapy?

start_urls contain those links from which the spider start crawling. If you want crawl recursively you should use crawlspider and define rules for that.


2 Answers

In order for the above solution to work, I had to make CrawlSpider inherit from InitSpider, and no longer from BaseSpider by changing, on the scrapy source code, the following. In file scrapy/contrib/spiders/crawl.py:

  1. add: from scrapy.contrib.spiders.init import InitSpider
  2. change class CrawlSpider(BaseSpider) to class CrawlSpider(InitSpider)

Otherwise the spider wouldn't call the init_request method.

Is there any other easier way?

like image 44
viniciusnz Avatar answered Oct 17 '22 04:10

viniciusnz


Do not override the parse function in a CrawlSpider:

When you are using a CrawlSpider, you shouldn't override the parse function. There's a warning in the CrawlSpider documentation here: http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.Rule

This is because with a CrawlSpider, parse (the default callback of any request) sends the response to be processed by the Rules.


Logging in before crawling:

In order to have some kind of initialisation before a spider starts crawling, you can use an InitSpider (which inherits from a CrawlSpider), and override the init_request function. This function will be called when the spider is initialising, and before it starts crawling.

In order for the Spider to begin crawling, you need to call self.initialized.

You can read the code that's responsible for this here (it has helpful docstrings).


An example:

from scrapy.contrib.spiders.init import InitSpider from scrapy.http import Request, FormRequest from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.spiders import Rule  class MySpider(InitSpider):     name = 'myspider'     allowed_domains = ['example.com']     login_page = 'http://www.example.com/login'     start_urls = ['http://www.example.com/useful_page/',                   'http://www.example.com/another_useful_page/']      rules = (         Rule(SgmlLinkExtractor(allow=r'-\w+.html$'),              callback='parse_item', follow=True),     )      def init_request(self):         """This function is called before crawling starts."""         return Request(url=self.login_page, callback=self.login)      def login(self, response):         """Generate a login request."""         return FormRequest.from_response(response,                     formdata={'name': 'herman', 'password': 'password'},                     callback=self.check_login_response)      def check_login_response(self, response):         """Check the response returned by a login request to see if we are         successfully logged in.         """         if "Hi Herman" in response.body:             self.log("Successfully logged in. Let's start crawling!")             # Now the crawling can begin..             return self.initialized()         else:             self.log("Bad times :(")             # Something went wrong, we couldn't log in, so nothing happens.      def parse_item(self, response):          # Scrape data from page 

Saving items:

Items your Spider returns are passed along to the Pipeline which is responsible for doing whatever you want done with the data. I recommend you read the documentation: http://doc.scrapy.org/en/0.14/topics/item-pipeline.html

If you have any problems/questions in regards to Items, don't hesitate to pop open a new question and I'll do my best to help.

like image 127
Acorn Avatar answered Oct 17 '22 05:10

Acorn