Scrapy - Understanding CrawlSpider and LinkExtractor

Tags:

So I'm trying to use CrawlSpider and understand the following example in the Scrapy Docs:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

rules = (
    # Extract links matching 'category.php' (but not matching 'subsection.php')
    # and follow links from them (since no callback means follow=True by default).
    Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

    # Extract links matching 'item.php' and parse them with the spider's method parse_item
    Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)

def parse_item(self, response):
    self.logger.info('Hi, this is an item page! %s', response.url)
    item = scrapy.Item()
    item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
    item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
    item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
    return item

The description then given is:

This spider would start crawling example.com’s home page, collecting category links, and item links, parsing the latter with the parse_item method. For each item response, some data will be extracted from the HTML using XPath, and an Item will be filled with it.

I understand that for the second rule, it extracts links from item.php and then extracts the information using the parse_item method. However, what exactly is the purpose of the first rule? It just says that it "collects" the links. What does that mean and why is it useful if they are not extracting any data from it?

977

asked Jun 13 '17 17:06

ocean800

1 Answers

CrawlSpider is very useful when crawling forums searching for posts for example, or categorized online stores when searching for product pages.

The idea is that "somehow" you have to go into each category, searching for links that correspond to product/item information you want to extract. Those product links are the ones specified on the second rule of that example (it says the ones that have item.php in the url).

Now how should the spider keep visiting links until finding those containing item.php? that's the first rule for. It says to visit every Link containing category.php but not subsection.php, which means it won't exactly extract any "item" from those links, but it defines the path of the spider to find the real items.

That's why you see it doesn't contain a callback method inside the rule, as it won't return that link response for you to process, because it will be directly followed.

183

answered Oct 28 '22 00:10

eLRuLL

Related questions
                            
                                Fit a curve to the boundary of a scatterplot
                            
                                Combine trigram with ranked searching in django 1.10
                            
                                Plot a pandas dataframe using the dataframe index for x coordinate in bokeh
                            
                                unittests for infinite loop
                            
                                Python Psycopg2 cursor.execute returning None
                            
                                What is the Right Syntax When Using .notnull() in Pandas?
                            
                                Remove some x labels with Seaborn
                            
                                Anaconda selenium and Chrome
                            
                                Use ipdb instead of pdb with py.test --pdb option
                            
                                AWS Lambda read contents of file in zip uploaded as source code
                            
                                Calculate set difference using jinja2 (in ansible)
                            
                                Slicing multiple ranges of columns in Pandas, by list of names
                            
                                Import theano gives the AttributeError: module 'theano' has no attribute 'gof'
                            
                                What does `S` signify in sympy
                            
                                Remove annotation while keeping plot matplotlib
                            
                                Changing colors for decision tree plot created using export graphviz
                            
                                Django autoreload: add watched file
                            
                                django rest framework: Get url path variable in a view
                            
                                add a different random number to every cell in a pandas dataframe
                            
                                Compare two list and output missing and extra element (Python)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrapy - Understanding CrawlSpider and LinkExtractor

Tags:

python

scrapy

web-crawler

scrapy-spider

ocean800

People also ask

1 Answers

eLRuLL

Recent Activity

Donate For Us