Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - Understanding CrawlSpider and LinkExtractor

So I'm trying to use CrawlSpider and understand the following example in the Scrapy Docs:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

rules = (
    # Extract links matching 'category.php' (but not matching 'subsection.php')
    # and follow links from them (since no callback means follow=True by default).
    Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

    # Extract links matching 'item.php' and parse them with the spider's method parse_item
    Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)

def parse_item(self, response):
    self.logger.info('Hi, this is an item page! %s', response.url)
    item = scrapy.Item()
    item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
    item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
    item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
    return item

The description then given is:

This spider would start crawling example.com’s home page, collecting category links, and item links, parsing the latter with the parse_item method. For each item response, some data will be extracted from the HTML using XPath, and an Item will be filled with it.

I understand that for the second rule, it extracts links from item.php and then extracts the information using the parse_item method. However, what exactly is the purpose of the first rule? It just says that it "collects" the links. What does that mean and why is it useful if they are not extracting any data from it?

like image 977
ocean800 Avatar asked Jun 13 '17 17:06

ocean800


People also ask

What is CrawlSpider?

CrawlSpider[source] This is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for following links by defining a set of rules.

Does Scrapy use lxml?

It uses lxml library under the hood, and implements an easy API on top of lxml API. It means Scrapy selectors are very similar in speed and parsing accuracy to lxml.

How does Scrapy framework work?

Scrapy provides Item pipelines that allow you to write functions in your spider that can process your data such as validating data, removing data and saving data to a database. It provides spider Contracts to test your spiders and allows you to create generic and deep crawlers as well.

Which type of classes can be defined in Scrapy to scrape data from a website?

Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites).


1 Answers

CrawlSpider is very useful when crawling forums searching for posts for example, or categorized online stores when searching for product pages.

The idea is that "somehow" you have to go into each category, searching for links that correspond to product/item information you want to extract. Those product links are the ones specified on the second rule of that example (it says the ones that have item.php in the url).

Now how should the spider keep visiting links until finding those containing item.php? that's the first rule for. It says to visit every Link containing category.php but not subsection.php, which means it won't exactly extract any "item" from those links, but it defines the path of the spider to find the real items.

That's why you see it doesn't contain a callback method inside the rule, as it won't return that link response for you to process, because it will be directly followed.

like image 183
eLRuLL Avatar answered Oct 28 '22 00:10

eLRuLL