Some pdf urls do not end with ".pdf" and hence, we can only find out after checking the response headers. I want to avoid downloading such pdfs. In Scrapy, checking headers after the response has been completely downloaded is easy. How do I just download and inspect the response header and download the body only later on?
Use HTTP request method HEAD
to get just headers. Then examine Content-Type
and based on that, you can place the same request, this time using GET
method. See this minimal working example:
# -*- coding: utf-8 -*-
from __future__ import print_function, unicode_literals
import scrapy
class DummySpider(scrapy.Spider):
name = 'dummy'
def start_requests(self):
yield scrapy.Request('https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf',
callback=self.parse_headers, method='HEAD')
def parse_headers(self, response):
if response.headers['Content-Type'].startswith('application/pdf'):
yield response.request.replace(callback=self.parse, method='GET')
def parse(self, response):
print(len(response.body))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With