Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inspecting response headers in scrapy without downloading the body

Tags:

python

scrapy

Some pdf urls do not end with ".pdf" and hence, we can only find out after checking the response headers. I want to avoid downloading such pdfs. In Scrapy, checking headers after the response has been completely downloaded is easy. How do I just download and inspect the response header and download the body only later on?

like image 569
Aayush Karki Avatar asked Dec 08 '22 15:12

Aayush Karki


1 Answers

Use HTTP request method HEAD to get just headers. Then examine Content-Type and based on that, you can place the same request, this time using GET method. See this minimal working example:

# -*- coding: utf-8 -*-
from __future__ import print_function, unicode_literals
import scrapy

class DummySpider(scrapy.Spider):
    name = 'dummy'

    def start_requests(self):
        yield scrapy.Request('https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf',
                             callback=self.parse_headers, method='HEAD')

    def parse_headers(self, response):
        if response.headers['Content-Type'].startswith('application/pdf'):
            yield response.request.replace(callback=self.parse, method='GET')

    def parse(self, response):
        print(len(response.body))
like image 139
Tomáš Linhart Avatar answered May 03 '23 13:05

Tomáš Linhart