I'm a newbie of scrapy and it's amazing crawler framework i have known!
In my project, I sent more than 90, 000 requests, but there are some of them failed. I set the log level to be INFO, and i just can see some statistics but no details.
2012-12-05 21:03:04+0800 [pd_spider] INFO: Dumping spider stats: {'downloader/exception_count': 1, 'downloader/exception_type_count/twisted.internet.error.ConnectionDone': 1, 'downloader/request_bytes': 46282582, 'downloader/request_count': 92383, 'downloader/request_method_count/GET': 92383, 'downloader/response_bytes': 123766459, 'downloader/response_count': 92382, 'downloader/response_status_count/200': 92382, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2012, 12, 5, 13, 3, 4, 836000), 'item_scraped_count': 46191, 'request_depth_max': 1, 'scheduler/memory_enqueued': 92383, 'start_time': datetime.datetime(2012, 12, 5, 12, 23, 25, 427000)}
Is there any way to get more detail report? For example, show those failed URLs. Thanks!
Using FormRequest. You can use the FormRequest. from_response() method for this job. Here's an example spider which uses it: import scrapy def authentication_failed(response): # TODO: Check the contents of the response and return True if it failed # or False if it succeeded.
Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items).
Yes, this is possible.
failed_urls
list to a basic spider class and appends urls to it if the response status of the url is 404 (this would need to be extended to cover other error statuses as required). from scrapy import Spider, signals class MySpider(Spider): handle_httpstatus_list = [404] name = "myspider" allowed_domains = ["example.com"] start_urls = [ 'http://www.example.com/thisurlexists.html', 'http://www.example.com/thisurldoesnotexist.html', 'http://www.example.com/neitherdoesthisone.html' ] def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.failed_urls = [] @classmethod def from_crawler(cls, crawler, *args, **kwargs): spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs) crawler.signals.connect(spider.handle_spider_closed, signals.spider_closed) return spider def parse(self, response): if response.status == 404: self.crawler.stats.inc_value('failed_url_count') self.failed_urls.append(response.url) def handle_spider_closed(self, reason): self.crawler.stats.set_value('failed_urls', ', '.join(self.failed_urls)) def process_exception(self, response, exception, spider): ex_class = "%s.%s" % (exception.__class__.__module__, exception.__class__.__name__) self.crawler.stats.inc_value('downloader/exception_count', spider=spider) self.crawler.stats.inc_value('downloader/exception_type_count/%s' % ex_class, spider=spider)
Example output (note that the downloader/exception_count* stats will only appear if exceptions are actually thrown - I simulated them by trying to run the spider after I'd turned off my wireless adapter):
2012-12-10 11:15:26+0000 [myspider] INFO: Dumping Scrapy stats: {'downloader/exception_count': 15, 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 15, 'downloader/request_bytes': 717, 'downloader/request_count': 3, 'downloader/request_method_count/GET': 3, 'downloader/response_bytes': 15209, 'downloader/response_count': 3, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/404': 2, 'failed_url_count': 2, 'failed_urls': 'http://www.example.com/thisurldoesnotexist.html, http://www.example.com/neitherdoesthisone.html' 'finish_reason': 'finished', 'finish_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 874000), 'log_count/DEBUG': 9, 'log_count/ERROR': 2, 'log_count/INFO': 4, 'response_received_count': 3, 'scheduler/dequeued': 3, 'scheduler/dequeued/memory': 3, 'scheduler/enqueued': 3, 'scheduler/enqueued/memory': 3, 'spider_exceptions/NameError': 2, 'start_time': datetime.datetime(2012, 12, 10, 11, 15, 26, 560000)}
Here's another example how to handle and collect 404 errors (checking github help pages):
from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.item import Item, Field class GitHubLinkItem(Item): url = Field() referer = Field() status = Field() class GithubHelpSpider(CrawlSpider): name = "github_help" allowed_domains = ["help.github.com"] start_urls = ["https://help.github.com", ] handle_httpstatus_list = [404] rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),) def parse_item(self, response): if response.status == 404: item = GitHubLinkItem() item['url'] = response.url item['referer'] = response.request.headers.get('Referer') item['status'] = response.status return item
Just run scrapy runspider
with -o output.json
and see list of items in the output.json
file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With