Understanding Scrapy's CrawlSpider rules

Tags:

I'm having trouble understanding how to use the rules field within my own Spider that inherits from CrawlSpider. My spider is trying to crawl through yellowpage listings for pizza in san francisco.

I've tried to keep my rules simple just to see if the spider would crawl through any of the links in the response, but I don't see it happening. My only results are that it yields the request for the next page and then yields a request for the subsequent page.

I have two questions: 1. Does the spider process the rules first before calling the callback when the response is received? Or vice versa? 2. When are the rules applied?

EDIT: I figured it out. I overrode the parse method from CrawlSpider. After looking at the parse method within that class, I realized that's where it checks the rules and crawls through those websites.

NOTE: Know what you're overriding

Here's my code:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import Selector
from yellowPages.items import YellowpagesItem
from scrapy.http import Request

class YellowPageSpider(CrawlSpider):
    name = "yellowpages"
    allowed_domains = ['www.yellowpages.com']
    businesses = []

    # start with one page
    start_urls = ['http://www.yellowpages.com/san-francisco-ca/pizza?g=san%20francisco%2C%20ca&q=pizza']

    rules = (Rule (SgmlLinkExtractor()
    , callback="parse_items", follow= True),
    )

    base_url = 'http://www.yellowpages.com'

    def parse(self, response):
        yield Request(response.url, callback=self.parse_business_listings_page)

    def parse_items(self, response):
        print "PARSE ITEMS. Visiting %s" % response.url
        return []

    def parse_business_listings_page(self, response):
        print "Visiting %s" % response.url

        self.businesses.append(self.extract_businesses_from_response(response))
        hxs = Selector(response)
        li_tags = hxs.xpath('//*[@id="main-content"]/div[4]/div[5]/ul/li')
        next_exist = False

        # Check to see if there's a "Next". If there is, store the links.
        # If not, return. 
        # This requires a linear search through the list of li_tags. Is there a faster way?
        for li in li_tags:
            li_text = li.xpath('.//a/text()').extract()
            li_data_page = li.xpath('.//a/@data-page').extract()
            # Note: sometimes li_text is an empty list so check to see if it is nonempty first
            if (li_text and li_text[0] == 'Next'):
                next_exist = True
                next_page_num = li_data_page[0]
                url = 'http://www.yellowpages.com/san-francisco-ca/pizza?g=san%20francisco%2C%20ca&q=pizza&page='+next_page_num
                yield Request(url, callback=self.parse_business_listings_page)

764

asked Aug 23 '14 07:08

OfLettersAndNumbers

1 Answers

So to the point to your two questions..

before the request is made, the crawlers rules are processed before making the request... and of course if the response does not comply to the allowed domain, response is received in theory but just gets drop'ed.
Again Crawler rules are used before a request is made.

NOTE!

In your example, when you call the parse() method... though in your case your using it right??! Would have to run it to confirm but those of you reading unless your explicitly overriding parse() method in a CRAWL spider... when using a crawl spider... the equivalant of pare in a spider to a crawler is parse_item()... parse() in a crawler is a its own logic function... USING AS A CALLBACK IN RULESET should not be done

https://doc.scrapy.org/en/latest/topics/spiders.html

136

answered Nov 03 '22 00:11

scriptso

Related questions
                            
                                Python autocomplete user input [closed]
                            
                                Is it possible to sort in python 3 using buffer-like (pointer-based) string comparisons?
                            
                                Django query for many-to-many subset containment
                            
                                Should I use HttpResponseRedirect here?
                            
                                Compiling SciPy to Android - Has it been done, any help on how to compile the FORTRAN code to Android Arm
                            
                                Python: Metaclass properties override class attributes, sometimes?
                            
                                PyCharm is missing project type drop down
                            
                                pandas.read_sql processing speed
                            
                                Cython syntax for declaring class hierarchies that have aliases
                            
                                PyV8 Issue Sublime Text3
                            
                                Pause and resuming job is not working in scrapy project
                            
                                Emacs fill mode for Python that doesn't break quoted strings causing errors
                            
                                Import structure that works both in packages and out, in both Python 2 and 3?
                            
                                Any way to create PDF file with wkhtmltopdf without returning HttpResponse or using URL? I just want to attach PDF file to email
                            
                                pandas timeseries resampling ending a given day
                            
                                Correct way to use ARMAResult.predict() function
                            
                                Pyserial: How to know if a serial port is free before open it
                            
                                Embedded charts in PyCharm IPython console
                            
                                How to detect if an image is partially occluded?
                            
                                Using Django's StaticLiveServerCase with staging server as well

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Understanding Scrapy's CrawlSpider rules

Tags:

python

rules

scrapy

web-crawler

OfLettersAndNumbers

People also ask

1 Answers

NOTE!

scriptso

Recent Activity

Donate For Us