How to use scrapy to crawl multiple pages?

Tags:

All examples i found of Scrapy talk about how to crawl a single page, pages with the same url schema or all the pages of a website. I need to crawl series of pages A, B, C where in A you got the link to B and so on.. For example the website structure is:

A
----> B
---------> C
D
E

I need to crawl all the C pages, but to get link to C i need to crawl before A and B. Any hints?

932

asked Dec 15 '13 19:12

tapioco123

2 Answers

see scrapy Request structure, to crawl such chain you'll have to use the callback parameter like the following:

class MySpider(BaseSpider):
    ...
    # spider starts here
    def parse(self, response):
        ...
        # A, D, E are done in parallel, A -> B -> C are done serially
        yield Request(url=<A url>,
                      ...
                      callback=parseA)
        yield Request(url=<D url>,
                      ...
                      callback=parseD)
        yield Request(url=<E url>,
                      ...
                      callback=parseE)

    def parseA(self, response):
        ...
        yield Request(url=<B url>,
                      ...
                      callback=parseB)

    def parseB(self, response):
        ...
        yield Request(url=<C url>,
                      ...
                      callback=parseC)

    def parseC(self, response):
        ...

    def parseD(self, response):
        ...

    def parseE(self, response):
        ...

answered Sep 23 '22 20:09

Guy Gavriely

Here is an example spider I wrote for a project of mine:

from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from yoMamaSpider.items import JokeItem
from yoMamaSpider.striputils import stripcats, stripjokes
import re

class Jokes4UsSpider(CrawlSpider):
    name = 'jokes4us'
    allowed_domains = ['jokes4us.com']
    start_urls = ["http://www.jokes4us.com/yomamajokes/"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        links = hxs.select('//a')
        for link in links:
            url = ''.join(link.select('./@href').extract())
            relevant_urls = re.compile(
                'http://www\.jokes4us\.com/yomamajokes/yomamas([a-zA-Z]+)')
            if relevant_urls.match(url):
                yield Request(url, callback=self.parse_page)

    def parse_page(self, response):
        hxs = HtmlXPathSelector(response)
        categories = stripcats(hxs.select('//title/text()').extract())
        joke_area = hxs.select('//p/text()').extract()
        for joke in joke_area:
            joke = stripjokes(joke)
            if len(joke) > 15:
                yield JokeItem(joke=joke, categories=categories)

I think the parse method is what you are after: It looks at every link on the start_urls page, it then uses some regex to decide if it is a relevant_url (i.e. a url i would like to scrape), if it is relevant - it scrapes the page using yield Request(url, callback=self.parse_page), which calls the parse_page method.

Is this the kind of thing you are after?

answered Sep 24 '22 20:09

Karim Tabet

Related questions
                            
                                How to pass a random function as an argument?
                            
                                Get the full exception type/message and stack trace
                            
                                How to check that a matrix contains a zero column?
                            
                                How to return a list of keys corresponding to the smallest value in dictionary
                            
                                Matplotlib histograms (basic questions)
                            
                                Pandas error: 'DataFrame' object has no attribute 'loc'
                            
                                How to change ttk.Treeview column width and weight in Python 3.3
                            
                                How to expose std::pair to python using boost::python?
                            
                                Why do methods of different objects of same class have same id?
                            
                                Why doesn't Python hash lists using ID?
                            
                                Create a 100 % stacked area chart with matplotlib
                            
                                reordering of numpy arrays
                            
                                Pandas Dataframe Mask based on index
                            
                                django countries currency code
                            
                                Python class instance starts method in new thread
                            
                                Missing parts on Beautiful Soup results
                            
                                TypeError: '_csv.reader' object has no attribute '__getitem__'?
                            
                                Unable to perform click action in selenium python
                            
                                Problems with python easy install
                            
                                Correct way to make a Python HTTPS request using requests module?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use scrapy to crawl multiple pages?

Tags:

python

scrapy

tapioco123

People also ask

2 Answers

Guy Gavriely

Karim Tabet

Recent Activity

Donate For Us