Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Scrapy's LinkExtractor

Tags:

python

scrapy

I'm trying to extract all links from a page using Scrapy, but am struggling to use the LinkExtractor. I've tried the following:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from Funda.items import FundaItem

class FundaSpider(scrapy.Spider):
    name = "Funda"
    allowed_domains = ["funda.nl"]
    start_urls = [
        "http://www.funda.nl/koop/amsterdam/"
    ]
    rules = (
    Rule(LinkExtractor(), callback='parse_item')
    )

    def parse_item(self, response):
        filename = response.url.split("/")[-2] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

My understanding is that including LinkExtractor() as a Rule should make the response contain only links. However, if I view the amsterdam.html file thus generated, it still seems to contain the entire web page, not just the links.

How can I get the response to contain just the links?

like image 628
Kurt Peek Avatar asked Mar 12 '23 17:03

Kurt Peek


1 Answers

Why would you think it would contain only links?

I think you are misunderstanding the CrawlSpider and the rule argument. Within rule you actually specify crawling logic rather than parsing logic. Parsing is being handled in the function specified the callback.

So if you want to you want to save only the links from the response you'd have to extract them from response first. You can even use the same LinkExtractor

class Spider(scrapy.Spider):
    name = 'spider1'
    le1 = LinkExtractor()
    rules = (
        Rule(le1, callback='parse_item')
    )

    def parse_item(self, response):
        # this will give you Link objects
        links = self.le1.extract_links(response)
        # this will give you html nodes of <a> 
        links = response.xpath("//a").extract()
like image 72
Granitosaurus Avatar answered Mar 19 '23 10:03

Granitosaurus