I am trying to use scrapy for crawling a website, but there's no sitemap or page indices for the website. How can I crawl all pages of the website with scrapy?
I just need to download all the pages of the site without extracting any item. Do I only need to set following all links in the Rule of Spider? But I don't know whether or not scrapy will avoid replicate urls in this way.
I just found the answer myself. With the CrawlSpider
class, we just need to set variable allow=()
in the SgmlLinkExtractor
function. As the documentation says:
allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.
In your Spider
, define allowed_domains
as a list of domains you want to crawl.
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
Then you can use response.follow()
to follow the links. See the docs for Spiders and the tutorial.
Alternatively, you can filter the domains with a LinkExtractor
(like David Thompson mentioned).
from scrapy.linkextractors import LinkExtractor
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = ['http://quotes.toscrape.com/page/1/']
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
for a in LinkExtractor(allow_domains=['quotes.toscrape.com']).extract_links(response):
yield response.follow(a, callback=self.parse)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With