Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to crawl a site only given domain url with scrapy

I am trying to use scrapy for crawling a website, but there's no sitemap or page indices for the website. How can I crawl all pages of the website with scrapy?

I just need to download all the pages of the site without extracting any item. Do I only need to set following all links in the Rule of Spider? But I don't know whether or not scrapy will avoid replicate urls in this way.

like image 486
David Thompson Avatar asked Jan 05 '13 23:01

David Thompson


2 Answers

I just found the answer myself. With the CrawlSpider class, we just need to set variable allow=() in the SgmlLinkExtractor function. As the documentation says:

allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.

like image 104
David Thompson Avatar answered Sep 20 '22 04:09

David Thompson


In your Spider, define allowed_domains as a list of domains you want to crawl.

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']

Then you can use response.follow() to follow the links. See the docs for Spiders and the tutorial.

Alternatively, you can filter the domains with a LinkExtractor (like David Thompson mentioned).

from scrapy.linkextractors import LinkExtractor

class QuotesSpider(scrapy.Spider):

    name = 'quotes'
    start_urls = ['http://quotes.toscrape.com/page/1/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        for a in LinkExtractor(allow_domains=['quotes.toscrape.com']).extract_links(response):
            yield response.follow(a, callback=self.parse)
like image 33
jpyams Avatar answered Sep 20 '22 04:09

jpyams