Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Python Scrapy - mimetype based filter to avoid non-text file downloads

I have a running scrapy project, but it is being bandwidth intensive because it tries to download a lot of binary files (zip, tar, mp3, ..etc).

I think the best solution is to filter the requests based on the mimetype (Content-Type:) HTTP header. I looked at the scrapy code and found this setting:

DOWNLOADER_HTTPCLIENTFACTORY = 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory'

I changed it to: DOWNLOADER_HTTPCLIENTFACTORY = 'myproject.webclients.ScrapyHTTPClientFactory'

And played a little with the ScrapyHTTPPageGetter, here is the edits highlighted:

class ScrapyHTTPPageGetter(HTTPClient):
    # this is my edit
    def handleEndHeaders(self):
        if 'Content-Type' in self.headers.keys():
            mimetype = str(self.headers['Content-Type'])
            # Actually I need only the html, but just in 
            # case I've preserved all the text
            if mimetype.find('text/') > -1: 
                # Good, this page is needed
                self.factory.noPage(Exception('Incorrect Content-Type'))

I feel this is wrong, I need more scrapy friendly way to cancel/drop request right after determining that it's unwanted mimetype. Instead of waiting for the whole data to be downloaded.

I'm asking specifically for this part self.factory.noPage(Exception('Incorrect Content-Type')) is that the correct way to cancel a request.

Update 1:
My current setup have crashed the Scrapy server, so please don't try to use the same code above to solve the problem.

Update 2:
I have setup an Apache-based website for testing using the following structure:

/var/www/scrapper-test/Zend -> /var/www/scrapper-test/Zend.zip (symlink)

I have noticed that Scrapy discards the ones with the .zip extension, but scraps the one without .zip even though it's just a symbolic link to it.

like image 630
Omar Al-Ithawi Avatar asked Nov 15 '12 16:11

Omar Al-Ithawi

1 Answers

I built this Middleware to exclude any response type that isn't in a whitelist of regular expressions:

from scrapy.http.response.html import HtmlResponse
from scrapy.exceptions import IgnoreRequest
from scrapy import log
import re

class FilterResponses(object):
    """Limit the HTTP response types that Scrapy dowloads."""

    def is_valid_response(type_whitelist, content_type_header):
        for type_regex in type_whitelist:
            if re.search(type_regex, content_type_header):
                return True
        return False

    def process_response(self, request, response, spider):
        Only allow HTTP response types that that match the given list of 
        filtering regexs
        # each spider must define the variable response_type_whitelist as an
        # iterable of regular expressions. ex. (r'text', )
        type_whitelist = getattr(spider, "response_type_whitelist", None)
        content_type_header = response.headers.get('content-type', None)
        if not type_whitelist:
            return response
        elif not content_type_header:
            log.msg("no content type header: {}".format(response.url), level=log.DEBUG, spider=spider)
            raise IgnoreRequest()
        elif self.is_valid_response(type_whitelist, content_type_header):
            log.msg("valid response {}".format(response.url), level=log.DEBUG, spider=spider)
            return response
            msg = "Ignoring request {}, content-type was not in whitelist".format(response.url)
            log.msg(msg, level=log.DEBUG, spider=spider)
            raise IgnoreRequest()

To use it, add it to settings.py:

    '[project_name].middlewares.FilterResponses': 999,
like image 88
saxman01 Avatar answered Dec 05 '22 22:12
