Using Scrapy's LinkExtractor

Question

I'm trying to extract all links from a page using Scrapy, but am struggling to use the LinkExtractor. I've tried the following:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from Funda.items import FundaItem

class FundaSpider(scrapy.Spider):
    name = "Funda"
    allowed_domains = ["funda.nl"]
    start_urls = [
        "http://www.funda.nl/koop/amsterdam/"
    ]
    rules = (
    Rule(LinkExtractor(), callback='parse_item')
    )

    def parse_item(self, response):
        filename = response.url.split("/")[-2] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

My understanding is that including LinkExtractor() as a Rule should make the response contain only links. However, if I view the amsterdam.html file thus generated, it still seems to contain the entire web page, not just the links.

How can I get the response to contain just the links?

Granitosaurus · Accepted Answer

Why would you think it would contain only links?

I think you are misunderstanding the CrawlSpider and the rule argument. Within rule you actually specify crawling logic rather than parsing logic. Parsing is being handled in the function specified the callback.

So if you want to you want to save only the links from the response you'd have to extract them from response first. You can even use the same LinkExtractor

class Spider(scrapy.Spider):
    name = 'spider1'
    le1 = LinkExtractor()
    rules = (
        Rule(le1, callback='parse_item')
    )

    def parse_item(self, response):
        # this will give you Link objects
        links = self.le1.extract_links(response)
        # this will give you html nodes of <a> 
        links = response.xpath("//a").extract()

Using Scrapy's LinkExtractor

Tags:

python

scrapy

Kurt Peek

1 Answers

Granitosaurus

Recent Activity

Donate For Us

Using Scrapy's LinkExtractor

Tags:

python

scrapy

Kurt Peek

1 Answers

Granitosaurus

Related questions

Recent Activity

Donate For Us