Crawling a site recursively using scrapy

Tags:

I am trying to scrap a site using scrapy.

This is the code I have written so far based on http://thuongnh.com/building-a-web-crawler-with-scrapy/ (original code does not work at all so I tried to rebuild it)

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders             import Spider
from scrapy.selector         import HtmlXPathSelector
from nettuts.items            import NettutsItem
from scrapy.http            import Request
from scrapy.linkextractors import LinkExtractor


class MySpider(Spider):
    name = "nettuts"
    allowed_domains = ["net.tutsplus.com"]
    start_urls = ["http://code.tutsplus.com/posts?"]
    rules = [Rule(LinkExtractor(allow = ('')), callback = 'parse', follow = True)]

    def parse(self, response):
        hxs  = HtmlXPathSelector(response)
        item = []

        titles    = hxs.xpath('//li[@class="posts__post"]/a/text()').extract()
        for title in titles:
            item             = NettutsItem()
            item["title"]     = title
            yield item
        return

Problem is that crawler goes to the start page but does not scrap any pages after that.

606

asked Dec 28 '15 22:12

Macro

2 Answers

Following can be a good idea to start with:

There can be two use cases for 'Crawling a site recursively using scrapy'.

A). We just want to move across the website using, say, the pagination buttons of the table and fetch data. This is relatively straight forward.

class TrainSpider(scrapy.Spider):
    name = "trip"
    start_urls = ['somewebsite']
    def parse(self, response):
        ''' do something with this parser '''
        next_page = response.xpath("//a[@class='next_page']/@href").extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)`

Observe the last 4 lines. Here

We are getting the next page link form next page xpath from the 'Next' pagination button.
The if condition checks, if its not the end of the pagination.
Join this link (that we got in step 1) with the main url using urljoin
A recursive call to the 'parse' call back method.

B)Not only we want to move across pages, but we also want to extract data from one or more links in that page.

class StationDetailSpider(CrawlSpider):
    name = 'train'
    start_urls = [someOtherWebsite]
    rules = (
        Rule(LinkExtractor(restrict_xpaths="//a[@class='next_page']"), follow=True),
        Rule(LinkExtractor(allow=r"/trains/\d+$"), callback='parse_trains')
    )
    def parse_trains(self, response):
        '''do your parsing here'''

Over here, observe that:

We are using the 'CrawlSpider' subclass of the 'scrapy.Spider' parent class
We have set to 'Rules'

a) The first rule just checks if there is a 'next_page' available and follows it.

b) The second rule requests for all the links on a page that are in the format, say '/trains/12343' and then calls the 'parse_trains' to perform and parsing operation.
Important: Note that we don't want to use the regular 'parse' method over here as we are using 'CrawlSpider' subclass. This class also has a 'parse' method so we don't want to override that. Just remember to name your call back method something other than 'parse'.

161

answered Oct 21 '22 18:10

Santosh Pillai

The problem is what Spider class you are using as a base. The scrapy.Spider is a simple spider that does not support rules and link extractors.

Instead, use CrawlSpider:

class MySpider(CrawlSpider):

answered Oct 21 '22 19:10

alecxe

Related questions
                            
                                How to convert Numpy Array to Python Dictionary with Sequential Keys?
                            
                                Django: ImportError: No module named social.apps.django_app
                            
                                Python Operator (+=) and SyntaxError
                            
                                What is the difference between a .py file and a .ipy file?
                            
                                Uniform LBP with scikit-image local_binary_pattern function
                            
                                sklearn LinearSVC - X has 1 features per sample; expecting 5
                            
                                Does requests.get() (Python, Requests module) pause a script until the response arrives?
                            
                                How can I plot many thousands of circles quickly?
                            
                                django rest framework get data from foreign key relation
                            
                                PyInstaller .exe file does nothing
                            
                                No module named sql_server.pyodbc.base
                            
                                Privlege error trying to create symlink using python on windows 10
                            
                                Python 'subprocess' CalledProcessError: Command '[...]' returned non-zero exit status 1 [duplicate]
                            
                                How do I get Python to know what Wifi the user is connected to?
                            
                                Replace string in pandas df column name
                            
                                PySpark: Using Object in RDD
                            
                                Authentication in Django rest framework function based views
                            
                                Why is pos_tag() so painfully slow and can this be avoided?
                            
                                Python multiprocessing pool stuck
                            
                                django-pytest setup_method database issue

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Crawling a site recursively using scrapy

Tags:

python

web-scraping

scrapy

Macro

People also ask

2 Answers

Santosh Pillai

alecxe

Recent Activity

Donate For Us