I'm new to using Scrapy and I wanted to understand how the rules are being used within the CrawlSpider.
If I have a rule where I'm crawling through the yellowpages for cupcake listings in Tucson, AZ, how does yielding a URL request activate the rule - specifically how does it activiate the restrict_xpath attribute?
Thanks.
Scrapy follows asynchronous processing i.e. the requesting process does not wait for the response, instead continues with further tasks. Once a response arrives, the requesting process proceeds to manipulate the response. The spiders in Scrapy work in the same way.
Scrapy is a web scraping library that is used to scrape, parse and collect web data. For all these functions we are having a pipelines.py file which is used to handle scraped data through various components (known as class) which are executed sequentially.
Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items).
Due to the built-in support for generating feed exports in multiple formats, as well as selecting and extracting data from various sources, the performance of Scrapy can be said to be faster than Beautiful Soup. Working with Beautiful Soup can speed up with the help of Multithreading process.
The rules attribute for a CrawlSpider
specify how to extract the links from a page and which callbacks should be called for those links. They are handled by the default parse()
method implemented in that class -- look here to read the source.
So, whenever you want to trigger the rules for an URL, you just need to yield a scrapy.Request(url, self.parse)
, and the Scrapy engine will send a request to that URL and apply the rules to the response.
The extraction of the links (that may or may not use restrict_xpaths
) is done by the LinkExtractor object registered for that rule. It basically searches for all the <a>
s and <area>
s elements in the whole page or only in the elements obtained after applying the restrict_xpaths
expressions if the attribute is set.
For example, say you have a CrawlSpider like so:
from scrapy.contrib.spiders.crawl import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor class MySpider(CrawlSpider): start_urls = ['http://someurlhere.com'] rules = ( Rule( LinkExtractor(restrict_xpaths=[ "//ul[@class='menu-categories']", "//ul[@class='menu-subcategories']"]), callback='parse' ), Rule( LinkExtractor(allow='/product.php?id=\d+'), callback='parse_product_page' ), ) def parse_product_page(self, response): # yield product item here
The engine starts sending requests to the urls in start_urls
and executing the default callback (the parse()
method in CrawlSpider) for their response.
For each response, the parse() method will execute the link extractors on it to get the links from the page. Namely, it calls the LinkExtractor.extract_links(response)
for each response object to get the urls, and then yields scrapy.Request(url, <rule_callback>)
objects.
The example code is an skeleton for a spider that crawls an e-commerce site following the links of product categories and subcategories, to get links for each of the product pages.
For the rules registered specifically in this spider, it would crawl the links inside the lists of "categories" and "subcategories" with the parse()
method as callback (which will trigger the crawl rules to be called for these pages), and the links matching the regular expression product.php?id=\d+
with the callback parse_product_page()
-- which would finally scrape the product data.
As you can see, pretty powerful stuff. =)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With