I have been working on a scrapy web scraper that crawls through all internal links from a start url and only collects external links with scrapy
. However, my main problem is classifying the external links and internal links. For example, when I try to filter out external links with link.startswith("http") or link.startswith("ftp") or link.startswith("www")
, if the website links its own website with an absolute path (www.my-domain.com/about
instead of /about
) then, it will classify it as the external link even if it's not. The following is my code:
import scrapy
from lab_relationship.items import Links
class WebSpider(scrapy.Spider):
name = "web"
allowed_domains = ["my-domain.com"]
start_urls = (
'www.my-domain.com',
)
def parse(self, response):
""" finds all external links"""
items = []
for link in set(response.xpath('//a/@href').extract()):
item = Links()
if len(link) > 1:
if link.startswith("/") or link.startswith("."):
# internal link
url = response.urljoin(link)
item['internal'] = url
#yield scrapy.Request(url, self.parse)
elif link.startswith("http") or link.startswith("ftp") or link.startswith("www"):
# external link
item['external'] = link
else:
# misc. links: mailto, id (#)
item['misc'] = link
items.append(item)
return items
Any suggestions?
Use the link extractor.
When instantiating you have to pass allowed domain. You don't have to worry about specifying the required tags, as (according to docs) the parameter tags
takes ('a', 'area')
by default.
On the example of Rust lang website, the code to print all the internal links from their domain would look like:
import scrapy
from scrapy.linkextractors import LinkExtractor
class RustSpider(scrapy.Spider):
name = "rust"
allowed_domains = ["www.rust-lang.org"]
start_urls = (
'http://www.rust-lang.org/',
)
def parse(self, response):
extractor = LinkExtractor(allow_domains='rust-lang.org')
links = extractor.extract_links(response)
for link in links:
print link.url
and the output would be a list of such links: https://doc.rust-lang.org/nightly/reference.html (I can't post more), while excluding all the links like those to StackOverflow.
Please be sure to check out documentation page, as link extractor has many parameters you may need.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With