Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Crawling a site recursively using scrapy

I am trying to scrap a site using scrapy.

This is the code I have written so far based on http://thuongnh.com/building-a-web-crawler-with-scrapy/ (original code does not work at all so I tried to rebuild it)

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders             import Spider
from scrapy.selector         import HtmlXPathSelector
from nettuts.items            import NettutsItem
from scrapy.http            import Request
from scrapy.linkextractors import LinkExtractor


class MySpider(Spider):
    name = "nettuts"
    allowed_domains = ["net.tutsplus.com"]
    start_urls = ["http://code.tutsplus.com/posts?"]
    rules = [Rule(LinkExtractor(allow = ('')), callback = 'parse', follow = True)]

    def parse(self, response):
        hxs  = HtmlXPathSelector(response)
        item = []

        titles    = hxs.xpath('//li[@class="posts__post"]/a/text()').extract()
        for title in titles:
            item             = NettutsItem()
            item["title"]     = title
            yield item
        return

Problem is that crawler goes to the start page but does not scrap any pages after that.

like image 606
Macro Avatar asked Dec 28 '15 22:12

Macro


People also ask

What is Scrapy in Python web crawling?

As web crawling is defined as “programmatically going over a collection of web pages and extracting data”, it is a helpful trick to collect data without an official API. Scrapy is a powerful tool when using python in web crawling. In our command line, execute: In this article, we will use Yummly as an example.

What are the best resources for web crawling in Python?

In data analytics, the most important resource is the data itself. As web crawling is defined as “programmatically going over a collection of web pages and extracting data”, it is a helpful trick to collect data without an official API. Scrapy is a powerful tool when using python in web crawling.

How do I set up a Scrapy project?

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run: This will create a tutorial directory with the following contents: Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites).

How to extract data with scrapy?

The best way to learn how to extract data with Scrapy is trying selectors using the Scrapy shell. Run: Remember to always enclose urls in quotes when running Scrapy shell from command-line, otherwise urls containing arguments (i.e. & character) will not work.


2 Answers

Following can be a good idea to start with:

There can be two use cases for 'Crawling a site recursively using scrapy'.

A). We just want to move across the website using, say, the pagination buttons of the table and fetch data. This is relatively straight forward.

class TrainSpider(scrapy.Spider):
    name = "trip"
    start_urls = ['somewebsite']
    def parse(self, response):
        ''' do something with this parser '''
        next_page = response.xpath("//a[@class='next_page']/@href").extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)`

Observe the last 4 lines. Here

  1. We are getting the next page link form next page xpath from the 'Next' pagination button.
  2. The if condition checks, if its not the end of the pagination.
  3. Join this link (that we got in step 1) with the main url using urljoin
  4. A recursive call to the 'parse' call back method.

B)Not only we want to move across pages, but we also want to extract data from one or more links in that page.

class StationDetailSpider(CrawlSpider):
    name = 'train'
    start_urls = [someOtherWebsite]
    rules = (
        Rule(LinkExtractor(restrict_xpaths="//a[@class='next_page']"), follow=True),
        Rule(LinkExtractor(allow=r"/trains/\d+$"), callback='parse_trains')
    )
    def parse_trains(self, response):
        '''do your parsing here'''

Over here, observe that:

  1. We are using the 'CrawlSpider' subclass of the 'scrapy.Spider' parent class

  2. We have set to 'Rules'

    a) The first rule just checks if there is a 'next_page' available and follows it.

    b) The second rule requests for all the links on a page that are in the format, say '/trains/12343' and then calls the 'parse_trains' to perform and parsing operation.

  3. Important: Note that we don't want to use the regular 'parse' method over here as we are using 'CrawlSpider' subclass. This class also has a 'parse' method so we don't want to override that. Just remember to name your call back method something other than 'parse'.

like image 161
Santosh Pillai Avatar answered Oct 21 '22 18:10

Santosh Pillai


The problem is what Spider class you are using as a base. The scrapy.Spider is a simple spider that does not support rules and link extractors.

Instead, use CrawlSpider:

class MySpider(CrawlSpider):
like image 20
alecxe Avatar answered Oct 21 '22 19:10

alecxe