Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: store all external links and crawl all interal links

I have been working on a scrapy web scraper that crawls through all internal links from a start url and only collects external links with scrapy. However, my main problem is classifying the external links and internal links. For example, when I try to filter out external links with link.startswith("http") or link.startswith("ftp") or link.startswith("www"), if the website links its own website with an absolute path (www.my-domain.com/about instead of /about) then, it will classify it as the external link even if it's not. The following is my code:

import scrapy
from lab_relationship.items import Links

class WebSpider(scrapy.Spider):
    name = "web"
    allowed_domains = ["my-domain.com"]
    start_urls = (
        'www.my-domain.com',
    )

    def parse(self, response):
        """ finds all external links"""
        items = []
        for link in set(response.xpath('//a/@href').extract()):
            item = Links()
            if len(link) > 1:
                if link.startswith("/") or link.startswith("."):
                    # internal link
                    url = response.urljoin(link)
                    item['internal'] = url
                    #yield scrapy.Request(url, self.parse)
                elif link.startswith("http") or link.startswith("ftp") or link.startswith("www"):
                    # external link
                    item['external'] = link
                else:
                    # misc. links: mailto, id (#)
                    item['misc'] = link
                items.append(item)
        return items

Any suggestions?

like image 589
THIS USER NEEDS HELP Avatar asked Oct 03 '15 19:10

THIS USER NEEDS HELP


1 Answers

Use the link extractor.

When instantiating you have to pass allowed domain. You don't have to worry about specifying the required tags, as (according to docs) the parameter tags takes ('a', 'area') by default.

On the example of Rust lang website, the code to print all the internal links from their domain would look like:

import scrapy
from scrapy.linkextractors import LinkExtractor


class RustSpider(scrapy.Spider):
    name = "rust"
    allowed_domains = ["www.rust-lang.org"]
    start_urls = (
        'http://www.rust-lang.org/',
    )

    def parse(self, response):
        extractor = LinkExtractor(allow_domains='rust-lang.org')
        links = extractor.extract_links(response)
        for link in links:
            print link.url

and the output would be a list of such links: https://doc.rust-lang.org/nightly/reference.html (I can't post more), while excluding all the links like those to StackOverflow.

Please be sure to check out documentation page, as link extractor has many parameters you may need.

like image 87
Jakub Avatar answered Sep 28 '22 07:09

Jakub