How to get the scrapy failure URLs?

Tags:

I'm a newbie of scrapy and it's amazing crawler framework i have known!

In my project, I sent more than 90, 000 requests, but there are some of them failed. I set the log level to be INFO, and i just can see some statistics but no details.

2012-12-05 21:03:04+0800 [pd_spider] INFO: Dumping spider stats: {'downloader/exception_count': 1,  'downloader/exception_type_count/twisted.internet.error.ConnectionDone': 1,  'downloader/request_bytes': 46282582,  'downloader/request_count': 92383,  'downloader/request_method_count/GET': 92383,  'downloader/response_bytes': 123766459,  'downloader/response_count': 92382,  'downloader/response_status_count/200': 92382,  'finish_reason': 'finished',  'finish_time': datetime.datetime(2012, 12, 5, 13, 3, 4, 836000),  'item_scraped_count': 46191,  'request_depth_max': 1,  'scheduler/memory_enqueued': 92383,  'start_time': datetime.datetime(2012, 12, 5, 12, 23, 25, 427000)}

Is there any way to get more detail report? For example, show those failed URLs. Thanks!

761

asked Dec 05 '12 13:12

Joe Wu

2 Answers

Yes, this is possible.

The code below adds a failed_urls list to a basic spider class and appends urls to it if the response status of the url is 404 (this would need to be extended to cover other error statuses as required).
Next I added a handle that joins the list into a single string and adds it to the spider's stats when the spider is closed.
Based on your comments, it's possible to track Twisted errors, and some of the answers below give examples on how to handle that particular use case
The code has been updated to work with Scrapy 1.8. All thanks to this should go to Juliano Mendieta, since all I did was simply to add his suggested edits and confirm that the spider worked as intended.

from scrapy import Spider, signals  class MySpider(Spider):     handle_httpstatus_list = [404]      name = "myspider"     allowed_domains = ["example.com"]     start_urls = [         'http://www.example.com/thisurlexists.html',         'http://www.example.com/thisurldoesnotexist.html',         'http://www.example.com/neitherdoesthisone.html'     ]      def __init__(self, *args, **kwargs):             super().__init__(*args, **kwargs)             self.failed_urls = []      @classmethod     def from_crawler(cls, crawler, *args, **kwargs):         spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)         crawler.signals.connect(spider.handle_spider_closed, signals.spider_closed)         return spider      def parse(self, response):         if response.status == 404:             self.crawler.stats.inc_value('failed_url_count')             self.failed_urls.append(response.url)      def handle_spider_closed(self, reason):         self.crawler.stats.set_value('failed_urls', ', '.join(self.failed_urls))      def process_exception(self, response, exception, spider):         ex_class = "%s.%s" % (exception.__class__.__module__, exception.__class__.__name__)         self.crawler.stats.inc_value('downloader/exception_count', spider=spider)         self.crawler.stats.inc_value('downloader/exception_type_count/%s' % ex_class, spider=spider)

Example output (note that the downloader/exception_count* stats will only appear if exceptions are actually thrown - I simulated them by trying to run the spider after I'd turned off my wireless adapter):

2012-12-10 11:15:26+0000 [myspider] INFO: Dumping Scrapy stats:     {'downloader/exception_count': 15,      'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 15,      'downloader/request_bytes': 717,      'downloader/request_count': 3,      'downloader/request_method_count/GET': 3,      'downloader/response_bytes': 15209,      'downloader/response_count': 3,      'downloader/response_status_count/200': 1,      'downloader/response_status_count/404': 2,      'failed_url_count': 2,      'failed_urls': 'http://www.example.com/thisurldoesnotexist.html, http://www.example.com/neitherdoesthisone.html'      'finish_reason': 'finished',      'finish_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 874000),      'log_count/DEBUG': 9,      'log_count/ERROR': 2,      'log_count/INFO': 4,      'response_received_count': 3,      'scheduler/dequeued': 3,      'scheduler/dequeued/memory': 3,      'scheduler/enqueued': 3,      'scheduler/enqueued/memory': 3,      'spider_exceptions/NameError': 2,      'start_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 560000)}

198

answered Oct 02 '22 14:10

Talvalin

Here's another example how to handle and collect 404 errors (checking github help pages):

from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.item import Item, Field   class GitHubLinkItem(Item):     url = Field()     referer = Field()     status = Field()   class GithubHelpSpider(CrawlSpider):     name = "github_help"     allowed_domains = ["help.github.com"]     start_urls = ["https://help.github.com", ]     handle_httpstatus_list = [404]     rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)      def parse_item(self, response):         if response.status == 404:             item = GitHubLinkItem()             item['url'] = response.url             item['referer'] = response.request.headers.get('Referer')             item['status'] = response.status              return item

Just run scrapy runspider with -o output.json and see list of items in the output.json file.

answered Oct 02 '22 14:10

alecxe

Related questions
                            
                                In Python, how do I iterate over one iterator and then another?
                            
                                gensim word2vec: Find number of words in vocabulary
                            
                                "Could not interpret optimizer identifier" error in Keras
                            
                                ValueError: Shape of passed values is (1, 6), indices imply (6, 6)
                            
                                numpy is already installed with Anaconda but I get an ImportError (DLL load failed: The specified module could not be found)
                            
                                How to install python developer package?
                            
                                Match text between two strings with regular expression
                            
                                Python: deleting a class attribute in a subclass
                            
                                forward fill specific columns in pandas dataframe
                            
                                Django Unit Testing taking a very long time to create test database
                            
                                Iterating over key and value of defaultdict dictionaries
                            
                                Convert pyQt UI to python
                            
                                How can I see print() statements in behave (BDD)
                            
                                How to import requirements.txt from an existing project using Poetry
                            
                                How do I get the string with name of a class?
                            
                                Searching text in a PDF using Python?
                            
                                Getting the integer index of a Pandas DataFrame row fulfilling a condition?
                            
                                Count occurrences of False or True in a column in pandas
                            
                                How do I get multiple values from checkboxes in Django
                            
                                Query Mongodb on month, day, year... of a datetime

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get the scrapy failure URLs?

Tags:

python

web-scraping

scrapy

report

Joe Wu

People also ask

2 Answers

Talvalin

alecxe

Recent Activity

Donate For Us