I'm trying to extract all links from a page using Scrapy, but am struggling to use the LinkExtractor. I've tried the following:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from Funda.items import FundaItem
class FundaSpider(scrapy.Spider):
name = "Funda"
allowed_domains = ["funda.nl"]
start_urls = [
"http://www.funda.nl/koop/amsterdam/"
]
rules = (
Rule(LinkExtractor(), callback='parse_item')
)
def parse_item(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
My understanding is that including LinkExtractor()
as a Rule
should make the response
contain only links. However, if I view the amsterdam.html
file thus generated, it still seems to contain the entire web page, not just the links.
How can I get the response
to contain just the links?
Why would you think it would contain only links?
I think you are misunderstanding the CrawlSpider
and the rule
argument. Within rule
you actually specify crawling logic rather than parsing logic. Parsing is being handled in the function specified the callback
.
So if you want to you want to save only the links from the response you'd have to extract them from response first. You can even use the same LinkExtractor
class Spider(scrapy.Spider):
name = 'spider1'
le1 = LinkExtractor()
rules = (
Rule(le1, callback='parse_item')
)
def parse_item(self, response):
# this will give you Link objects
links = self.le1.extract_links(response)
# this will give you html nodes of <a>
links = response.xpath("//a").extract()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With