Get scrapy spider to crawl entire site

Tags:

I am using scrapy to crawl old sites that I own, I am using the code below as my spider. I don't mind having files outputted for each webpage, or a database with all the content within that. But I do need to be able to have the spider crawl the whole thing with out me having to put in every single url that I am currently having to do

import scrapy

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["www.example.com"]
    start_urls = [
        "http://www.example.com/contactus"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

451

asked Apr 25 '16 10:04

Lewis Smith

1 Answers

To crawl whole site you should use the CrawlSpider instead of the scrapy.Spider

Here's an example

For your purposes try using something like this:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        filename = response.url.split("/")[-2] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

Also, take a look at this article

100

answered Sep 18 '22 14:09

Daniil Mashkin

Related questions
                            
                                Pandas HDF5 as a Database
                            
                                Unable to query a local variable in pandas 0.14.0
                            
                                Get image dpi for tif files using Pillow
                            
                                Abort trap: 6 when running a python script
                            
                                Why is communication via shared memory so much slower than via queues?
                            
                                How to disable Python shell spawning less with "help"
                            
                                os.path.join fails with "TypeError: object of type 'LocalPath' has no len()"
                            
                                Faster way to remove outliers by group in large pandas DataFrame [duplicate]
                            
                                Python Multiprocessing combined with Multithreading
                            
                                Heroku TypeError: parse_requirements() missing 1 required keyword argument: 'session'
                            
                                Setting multiple object attributes at once
                            
                                Calculate curl of a vector field in Python and plot it with matplotlib
                            
                                Calculate Distance to Nearest Feature with Geopandas
                            
                                How to get array of random integers of non-default type in numpy
                            
                                Converting a Python XML ElementTree to a String
                            
                                Seaborn Heatmap Key Words
                            
                                paramiko python module hangs at stdout.read()
                            
                                Is it possible to add hatches to each individual bar in seaborn.barplot?
                            
                                Convert file to base64 string on Python 3
                            
                                Why can't I import COUNTRIES from pygal.i18n

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get scrapy spider to crawl entire site

Tags:

python

scrapy

scrapy-spider

Lewis Smith

People also ask

1 Answers

Daniil Mashkin

Recent Activity

Donate For Us