I'm having trouble understanding how to use the rules field within my own Spider that inherits from CrawlSpider. My spider is trying to crawl through yellowpage listings for pizza in san francisco.
I've tried to keep my rules simple just to see if the spider would crawl through any of the links in the response, but I don't see it happening. My only results are that it yields the request for the next page and then yields a request for the subsequent page.
I have two questions: 1. Does the spider process the rules first before calling the callback when the response is received? Or vice versa? 2. When are the rules applied?
EDIT: I figured it out. I overrode the parse method from CrawlSpider. After looking at the parse method within that class, I realized that's where it checks the rules and crawls through those websites.
NOTE: Know what you're overriding
Here's my code:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import Selector
from yellowPages.items import YellowpagesItem
from scrapy.http import Request
class YellowPageSpider(CrawlSpider):
name = "yellowpages"
allowed_domains = ['www.yellowpages.com']
businesses = []
# start with one page
start_urls = ['http://www.yellowpages.com/san-francisco-ca/pizza?g=san%20francisco%2C%20ca&q=pizza']
rules = (Rule (SgmlLinkExtractor()
, callback="parse_items", follow= True),
)
base_url = 'http://www.yellowpages.com'
def parse(self, response):
yield Request(response.url, callback=self.parse_business_listings_page)
def parse_items(self, response):
print "PARSE ITEMS. Visiting %s" % response.url
return []
def parse_business_listings_page(self, response):
print "Visiting %s" % response.url
self.businesses.append(self.extract_businesses_from_response(response))
hxs = Selector(response)
li_tags = hxs.xpath('//*[@id="main-content"]/div[4]/div[5]/ul/li')
next_exist = False
# Check to see if there's a "Next". If there is, store the links.
# If not, return.
# This requires a linear search through the list of li_tags. Is there a faster way?
for li in li_tags:
li_text = li.xpath('.//a/text()').extract()
li_data_page = li.xpath('.//a/@data-page').extract()
# Note: sometimes li_text is an empty list so check to see if it is nonempty first
if (li_text and li_text[0] == 'Next'):
next_exist = True
next_page_num = li_data_page[0]
url = 'http://www.yellowpages.com/san-francisco-ca/pizza?g=san%20francisco%2C%20ca&q=pizza&page='+next_page_num
yield Request(url, callback=self.parse_business_listings_page)
Utilizing the Rules class in Scrapy has a wide range of benefits. It allows you to add extra functionality into your Spider, enhance existing features (like requests) and create new possibilities. We’ll be explaining this in more detail in the sections below. The Rules class can take many different parameters, each with it’s own special effect.
The CrawlSpider besides having the same attributes as the regular Spider has a new attribute: rules. ‘Rules’ is a list of one or more Rule objects, where each Rule defines one type of behaviour for crawling the site.
The CrawlSpider besides having the same attributes as the regular Spider has a new attribute: rules. ‘Rules’ is a list of one or more Rule objects, where each Rule defines one type of behaviour for crawling the site. Also, we are going to use LinkExtractor: An object which defines how links will be extracted from each crawled page.
This is the first part of a 4 part tutorial series on web scraping using Scrapy and Selenium. Open in app Home Notifications Lists Stories Write Published in Towards Data Science Karthikeyan P Follow Jul 31, 2020 12 min read Member-only Save WEB SCRAPING SERIES Web scraping with Scrapy: Theoretical Understanding An Introduction to Scrapy
So to the point to your two questions..
before the request is made, the crawlers rules are processed before making the request... and of course if the response does not comply to the allowed domain, response is received in theory but just gets drop'ed.
Again Crawler rules are used before a request is made.
In your example, when you call the parse() method... though in your case your using it right??! Would have to run it to confirm but those of you reading unless your explicitly overriding parse() method in a CRAWL spider... when using a crawl spider... the equivalant of pare in a spider to a crawler is parse_item()... parse() in a crawler is a its own logic function... USING AS A CALLBACK IN RULESET should not be done
https://doc.scrapy.org/en/latest/topics/spiders.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With