Scrapy get all links from any website

Tags:

I have the following code for a web crawler in Python 3:

import requests
from bs4 import BeautifulSoup
import re

def get_links(link):

    return_links = []

    r = requests.get(link)

    soup = BeautifulSoup(r.content, "lxml")

    if r.status_code != 200:
        print("Error. Something is wrong here")
    else:
        for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
            return_links.append(link.get('href')))

def recursive_search(links)
    for i in links:
        links.append(get_links(i))
    recursive_search(links)


recursive_search(get_links("https://www.brandonskerritt.github.io"))

The code basically gets all the links off of my GitHub pages website, and then it gets all the links off of those links, and so on until the end of time or an error occurs.

I want to recreate this code in Scrapy so it can obey robots.txt and be a better web crawler overall. I've researched online and I can only find tutorials / guides / stackoverflow / quora / blog posts about how to scrape a specific domain (allowed_domains=["google.com"], for example). I do not want to do this. I want to create code that will scrape all websites recursively.

This isn't much of a problem but all the blog posts etc only show how to get the links from a specific website (for example, it might be that he links are in list tags). The code I have above works for all anchor tags, regardless of what website it's being run on.

I do not want to use this in the wild, I need it for demonstration purposes so I'm not going to suddenly annoy everyone with excessive web crawling.

Any help will be appreciated!

282

asked Feb 23 '18 11:02

Brandon Skerritt

1 Answers

There is an entire section of scrapy guide dedicated to broad crawls. I suggest you to fine-grain your settings for doing this succesfully.

For recreating the behaviour you need in scrapy, you must

set your start url in your page.
write a parse function that follow all links and recursively call itself, adding to a spider variable the requested urls

An untested example (that can be, of course, refined):

class AllSpider(scrapy.Spider):
    name = 'all'

    start_urls = ['https://yourgithub.com']

    def __init__(self):
        self.links=[]

    def parse(self, response):
        self.links.append(response.url)
        for href in response.css('a::attr(href)'):
            yield response.follow(href, self.parse)

105

answered Sep 28 '22 20:09

Lore

Related questions
                            
                                Importing module not working
                            
                                retrieve async ads insights results from FB ads API with pagination
                            
                                Regex doesn't stop evaluating after matching with first rule with OR operator
                            
                                Gaussian Mixture Models of an Image's Histogram
                            
                                How to get latest file-name or file from S3 bucket using event triggered lambda
                            
                                Rendering a unicode/ascii character to a numpy array
                            
                                python @memoize vs functools.lru_cache
                            
                                How to test a class' inherited methods in pytest
                            
                                When is the locals dictionary set?
                            
                                Inheritance of class variables in python
                            
                                Django: Do I really need apps.py inside an app?
                            
                                How to detect if a function has been defined locally?
                            
                                Should I use pip.main() or subprocess.call() to invoke pip commands?
                            
                                Convert Multipolygon to Polygon in Python [closed]
                            
                                TensorFlow - Read video frames from TFRecords file
                            
                                Is there any way to make anaconda smaller?
                            
                                Pandas read DataFrame with datetime columns from clipboard
                            
                                Tensorflow import error
                            
                                IndexError: index 1491188345 is out of bounds for axis 0 with size 1491089723
                            
                                fractions.Fraction() returns different nom., denom. pair when parsing a float or its string representation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrapy get all links from any website

Tags:

python

python-3.x

scrapy

web-crawler

Brandon Skerritt

People also ask

1 Answers

Lore

Recent Activity

Donate For Us