I have a running scrapy project, but it is being bandwidth intensive because it tries to download a lot of binary files (zip, tar, mp3, ..etc).
I think the best solution is to filter the requests based on the mimetype (Content-Type:) HTTP header. I looked at the scrapy code and found this setting:
DOWNLOADER_HTTPCLIENTFACTORY = 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory'
I changed it to: DOWNLOADER_HTTPCLIENTFACTORY = 'myproject.webclients.ScrapyHTTPClientFactory'
And played a little with the ScrapyHTTPPageGetter
, here is the edits highlighted:
class ScrapyHTTPPageGetter(HTTPClient):
# this is my edit
def handleEndHeaders(self):
if 'Content-Type' in self.headers.keys():
mimetype = str(self.headers['Content-Type'])
# Actually I need only the html, but just in
# case I've preserved all the text
if mimetype.find('text/') > -1:
# Good, this page is needed
self.factory.gotHeaders(self.headers)
else:
self.factory.noPage(Exception('Incorrect Content-Type'))
I feel this is wrong, I need more scrapy friendly way to cancel/drop request right after determining that it's unwanted mimetype. Instead of waiting for the whole data to be downloaded.
Edit:
I'm asking specifically for this part self.factory.noPage(Exception('Incorrect Content-Type'))
is that the correct way to cancel a request.
Update 1:
My current setup have crashed the Scrapy server, so please don't try to use the same code above to solve the problem.
Update 2:
I have setup an Apache-based website for testing using the following structure:
/var/www/scrapper-test/Zend -> /var/www/scrapper-test/Zend.zip (symlink)
/var/www/scrapper-test/Zend.zip
I have noticed that Scrapy discards the ones with the .zip
extension, but scraps the one without .zip even though it's just a symbolic link to it.
I built this Middleware to exclude any response type that isn't in a whitelist of regular expressions:
from scrapy.http.response.html import HtmlResponse
from scrapy.exceptions import IgnoreRequest
from scrapy import log
import re
class FilterResponses(object):
"""Limit the HTTP response types that Scrapy dowloads."""
@staticmethod
def is_valid_response(type_whitelist, content_type_header):
for type_regex in type_whitelist:
if re.search(type_regex, content_type_header):
return True
return False
def process_response(self, request, response, spider):
"""
Only allow HTTP response types that that match the given list of
filtering regexs
"""
# each spider must define the variable response_type_whitelist as an
# iterable of regular expressions. ex. (r'text', )
type_whitelist = getattr(spider, "response_type_whitelist", None)
content_type_header = response.headers.get('content-type', None)
if not type_whitelist:
return response
elif not content_type_header:
log.msg("no content type header: {}".format(response.url), level=log.DEBUG, spider=spider)
raise IgnoreRequest()
elif self.is_valid_response(type_whitelist, content_type_header):
log.msg("valid response {}".format(response.url), level=log.DEBUG, spider=spider)
return response
else:
msg = "Ignoring request {}, content-type was not in whitelist".format(response.url)
log.msg(msg, level=log.DEBUG, spider=spider)
raise IgnoreRequest()
To use it, add it to settings.py:
DOWNLOADER_MIDDLEWARES = {
'[project_name].middlewares.FilterResponses': 999,
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With