Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use scrapy to crawl multiple pages?

Tags:

python

scrapy

All examples i found of Scrapy talk about how to crawl a single page, pages with the same url schema or all the pages of a website. I need to crawl series of pages A, B, C where in A you got the link to B and so on.. For example the website structure is:

A
----> B
---------> C
D
E

I need to crawl all the C pages, but to get link to C i need to crawl before A and B. Any hints?

like image 932
tapioco123 Avatar asked Dec 15 '13 19:12

tapioco123


People also ask

How do you get to the next Scrapy page?

Run the code with scrapy crawl spider -o next_page. json and check the result.

How do you run multiple spiders in a Scrapy?

We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.


2 Answers

see scrapy Request structure, to crawl such chain you'll have to use the callback parameter like the following:

class MySpider(BaseSpider):
    ...
    # spider starts here
    def parse(self, response):
        ...
        # A, D, E are done in parallel, A -> B -> C are done serially
        yield Request(url=<A url>,
                      ...
                      callback=parseA)
        yield Request(url=<D url>,
                      ...
                      callback=parseD)
        yield Request(url=<E url>,
                      ...
                      callback=parseE)

    def parseA(self, response):
        ...
        yield Request(url=<B url>,
                      ...
                      callback=parseB)

    def parseB(self, response):
        ...
        yield Request(url=<C url>,
                      ...
                      callback=parseC)

    def parseC(self, response):
        ...

    def parseD(self, response):
        ...

    def parseE(self, response):
        ...
like image 54
Guy Gavriely Avatar answered Sep 23 '22 20:09

Guy Gavriely


Here is an example spider I wrote for a project of mine:

from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from yoMamaSpider.items import JokeItem
from yoMamaSpider.striputils import stripcats, stripjokes
import re

class Jokes4UsSpider(CrawlSpider):
    name = 'jokes4us'
    allowed_domains = ['jokes4us.com']
    start_urls = ["http://www.jokes4us.com/yomamajokes/"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        links = hxs.select('//a')
        for link in links:
            url = ''.join(link.select('./@href').extract())
            relevant_urls = re.compile(
                'http://www\.jokes4us\.com/yomamajokes/yomamas([a-zA-Z]+)')
            if relevant_urls.match(url):
                yield Request(url, callback=self.parse_page)

    def parse_page(self, response):
        hxs = HtmlXPathSelector(response)
        categories = stripcats(hxs.select('//title/text()').extract())
        joke_area = hxs.select('//p/text()').extract()
        for joke in joke_area:
            joke = stripjokes(joke)
            if len(joke) > 15:
                yield JokeItem(joke=joke, categories=categories)

I think the parse method is what you are after: It looks at every link on the start_urls page, it then uses some regex to decide if it is a relevant_url (i.e. a url i would like to scrape), if it is relevant - it scrapes the page using yield Request(url, callback=self.parse_page), which calls the parse_page method.

Is this the kind of thing you are after?

like image 35
Karim Tabet Avatar answered Sep 24 '22 20:09

Karim Tabet