Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy get website with error "DNS lookup failed"

I'm trying to use Scrapy to get all links on websites where the "DNS lookup failed".

The problem is, every website without any errors are print on the parse_obj method but when an url return DNS lookup failed, the callback parse_obj is not call.

I want to get all domain with the error "DNS lookup failed", how can I do that ?

Logs :

2016-03-08 12:55:12 [scrapy] INFO: Spider opened
2016-03-08 12:55:12 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-08 12:55:12 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-08 12:55:12 [scrapy] DEBUG: Crawled (200) <GET http://domain.com> (referer: None)
2016-03-08 12:55:12 [scrapy] DEBUG: Retrying <GET http://expired-domain.com/> (failed 1 times): DNS lookup failed: address 'expired-domain.com' not found: [Errno 11001] getaddrinfo failed.

Code :

class MyItem(Item):
    url= Field()

class someSpider(CrawlSpider):
    name = 'Crawler'        
    start_urls = ['http://domain.com']
    rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),)

    def parse_obj(self, response):
        item = MyItem()
        item['url'] = []
        for link in LxmlLinkExtractor(allow=()).extract_links(response):
            parsed_uri = urlparse(link.url)
            url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
            print url
like image 995
Pixel Avatar asked Mar 14 '23 00:03

Pixel


1 Answers

CrawlSpider Rules do not allow passing errbacks (that's a shame)

Here's a variation of another answer I gave for catching DNS errors:

# -*- coding: utf-8 -*-
import random

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError


class HttpbinSpider(CrawlSpider):
    name = "httpbin"

    # this will generate test links so that we can see CrawlSpider in action
    start_urls = (
        'https://httpbin.org/links/10/0',
    )
    rules = (
        Rule(LinkExtractor(),
             callback='parse_page',
             # hook to be called when this Rule generates a Request
             process_request='add_errback'),
    )

    # this is just to no retry errors for this example spider
    custom_settings = {
        'RETRY_ENABLED': False
    }

    # method to be called for each Request generated by the Rules above,
    # here, adding an errback to catch all sorts of errors
    def add_errback(self, request):
        self.logger.debug("add_errback: patching %r" % request)

        # this is a hack to trigger a DNS error randomly
        rn = random.randint(0, 2)
        if rn == 1:
            newurl = request.url.replace('httpbin.org', 'httpbin.organisation')
            self.logger.debug("add_errback: patching url to %s" % newurl)
            return request.replace(url=newurl,
                                   errback=self.errback_httpbin)

        # this is the general case: adding errback to all requests
        return request.replace(errback=self.errback_httpbin)

    def parse_page(self, response):
        self.logger.info("parse_page: %r" % response)

    def errback_httpbin(self, failure):
        # log all errback failures,
        # in case you want to do something special for some errors,
        # you may need the failure's type
        self.logger.error(repr(failure))

        if failure.check(HttpError):
            # you can get the response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

This is what you get on the console:

$ scrapy crawl httpbin
2016-03-08 15:16:30 [scrapy] INFO: Scrapy 1.0.5 started (bot: httpbinlinks)
2016-03-08 15:16:30 [scrapy] INFO: Optional features available: ssl, http11
2016-03-08 15:16:30 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'httpbinlinks.spiders', 'SPIDER_MODULES': ['httpbinlinks.spiders'], 'BOT_NAME': 'httpbinlinks'}
2016-03-08 15:16:30 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-08 15:16:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-08 15:16:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-08 15:16:30 [scrapy] INFO: Enabled item pipelines: 
2016-03-08 15:16:30 [scrapy] INFO: Spider opened
2016-03-08 15:16:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-08 15:16:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-08 15:16:30 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/0> (referer: None)
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/1>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/2>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/3>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/4>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/5>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching url to https://httpbin.organisation/links/10/5
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/6>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/7>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/8>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/9>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching url to https://httpbin.organisation/links/10/9
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/8> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [httpbin] ERROR: <twisted.python.failure.Failure twisted.internet.error.DNSLookupError: DNS lookup failed: address 'httpbin.organisation' not found: [Errno -5] No address associated with hostname.>
2016-03-08 15:16:31 [httpbin] ERROR: DNSLookupError on https://httpbin.organisation/links/10/5
2016-03-08 15:16:31 [httpbin] ERROR: <twisted.python.failure.Failure twisted.internet.error.DNSLookupError: DNS lookup failed: address 'httpbin.organisation' not found: [Errno -5] No address associated with hostname.>
2016-03-08 15:16:31 [httpbin] ERROR: DNSLookupError on https://httpbin.organisation/links/10/9
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/8>
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/7> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/6> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/3> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/4> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/1> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/2> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/7>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/6>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/3>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/4>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/1>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/2>
2016-03-08 15:16:31 [scrapy] INFO: Closing spider (finished)
2016-03-08 15:16:31 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,
 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
 'downloader/request_bytes': 2577,
 'downloader/request_count': 10,
 'downloader/request_method_count/GET': 10,
 'downloader/response_bytes': 3968,
 'downloader/response_count': 8,
 'downloader/response_status_count/200': 8,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 3, 8, 14, 16, 31, 761515),
 'log_count/DEBUG': 20,
 'log_count/ERROR': 4,
 'log_count/INFO': 14,
 'request_depth_max': 1,
 'response_received_count': 8,
 'scheduler/dequeued': 10,
 'scheduler/dequeued/memory': 10,
 'scheduler/enqueued': 10,
 'scheduler/enqueued/memory': 10,
 'start_time': datetime.datetime(2016, 3, 8, 14, 16, 30, 427657)}
2016-03-08 15:16:31 [scrapy] INFO: Spider closed (finished)
like image 198
paul trmbrth Avatar answered Mar 23 '23 08:03

paul trmbrth