Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the scrapy failure URLs?

I'm a newbie of scrapy and it's amazing crawler framework i have known!

In my project, I sent more than 90, 000 requests, but there are some of them failed. I set the log level to be INFO, and i just can see some statistics but no details.

2012-12-05 21:03:04+0800 [pd_spider] INFO: Dumping spider stats: {'downloader/exception_count': 1,  'downloader/exception_type_count/twisted.internet.error.ConnectionDone': 1,  'downloader/request_bytes': 46282582,  'downloader/request_count': 92383,  'downloader/request_method_count/GET': 92383,  'downloader/response_bytes': 123766459,  'downloader/response_count': 92382,  'downloader/response_status_count/200': 92382,  'finish_reason': 'finished',  'finish_time': datetime.datetime(2012, 12, 5, 13, 3, 4, 836000),  'item_scraped_count': 46191,  'request_depth_max': 1,  'scheduler/memory_enqueued': 92383,  'start_time': datetime.datetime(2012, 12, 5, 12, 23, 25, 427000)} 

Is there any way to get more detail report? For example, show those failed URLs. Thanks!

like image 761
Joe Wu Avatar asked Dec 05 '12 13:12

Joe Wu


People also ask

How do you get a response from Scrapy request?

Using FormRequest. You can use the FormRequest. from_response() method for this job. Here's an example spider which uses it: import scrapy def authentication_failed(response): # TODO: Check the contents of the response and return True if it failed # or False if it succeeded.

What does Scrapy request return?

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.

What is a spider in Scrapy?

Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items).


2 Answers

Yes, this is possible.

  • The code below adds a failed_urls list to a basic spider class and appends urls to it if the response status of the url is 404 (this would need to be extended to cover other error statuses as required).
  • Next I added a handle that joins the list into a single string and adds it to the spider's stats when the spider is closed.
  • Based on your comments, it's possible to track Twisted errors, and some of the answers below give examples on how to handle that particular use case
  • The code has been updated to work with Scrapy 1.8. All thanks to this should go to Juliano Mendieta, since all I did was simply to add his suggested edits and confirm that the spider worked as intended.

from scrapy import Spider, signals  class MySpider(Spider):     handle_httpstatus_list = [404]      name = "myspider"     allowed_domains = ["example.com"]     start_urls = [         'http://www.example.com/thisurlexists.html',         'http://www.example.com/thisurldoesnotexist.html',         'http://www.example.com/neitherdoesthisone.html'     ]      def __init__(self, *args, **kwargs):             super().__init__(*args, **kwargs)             self.failed_urls = []      @classmethod     def from_crawler(cls, crawler, *args, **kwargs):         spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)         crawler.signals.connect(spider.handle_spider_closed, signals.spider_closed)         return spider      def parse(self, response):         if response.status == 404:             self.crawler.stats.inc_value('failed_url_count')             self.failed_urls.append(response.url)      def handle_spider_closed(self, reason):         self.crawler.stats.set_value('failed_urls', ', '.join(self.failed_urls))      def process_exception(self, response, exception, spider):         ex_class = "%s.%s" % (exception.__class__.__module__, exception.__class__.__name__)         self.crawler.stats.inc_value('downloader/exception_count', spider=spider)         self.crawler.stats.inc_value('downloader/exception_type_count/%s' % ex_class, spider=spider) 

Example output (note that the downloader/exception_count* stats will only appear if exceptions are actually thrown - I simulated them by trying to run the spider after I'd turned off my wireless adapter):

2012-12-10 11:15:26+0000 [myspider] INFO: Dumping Scrapy stats:     {'downloader/exception_count': 15,      'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 15,      'downloader/request_bytes': 717,      'downloader/request_count': 3,      'downloader/request_method_count/GET': 3,      'downloader/response_bytes': 15209,      'downloader/response_count': 3,      'downloader/response_status_count/200': 1,      'downloader/response_status_count/404': 2,      'failed_url_count': 2,      'failed_urls': 'http://www.example.com/thisurldoesnotexist.html, http://www.example.com/neitherdoesthisone.html'      'finish_reason': 'finished',      'finish_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 874000),      'log_count/DEBUG': 9,      'log_count/ERROR': 2,      'log_count/INFO': 4,      'response_received_count': 3,      'scheduler/dequeued': 3,      'scheduler/dequeued/memory': 3,      'scheduler/enqueued': 3,      'scheduler/enqueued/memory': 3,      'spider_exceptions/NameError': 2,      'start_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 560000)} 
like image 198
Talvalin Avatar answered Oct 02 '22 14:10

Talvalin


Here's another example how to handle and collect 404 errors (checking github help pages):

from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.item import Item, Field   class GitHubLinkItem(Item):     url = Field()     referer = Field()     status = Field()   class GithubHelpSpider(CrawlSpider):     name = "github_help"     allowed_domains = ["help.github.com"]     start_urls = ["https://help.github.com", ]     handle_httpstatus_list = [404]     rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)      def parse_item(self, response):         if response.status == 404:             item = GitHubLinkItem()             item['url'] = response.url             item['referer'] = response.request.headers.get('Referer')             item['status'] = response.status              return item 

Just run scrapy runspider with -o output.json and see list of items in the output.json file.

like image 31
alecxe Avatar answered Oct 02 '22 14:10

alecxe