Scrapy and response status code: how to check against it?

Tags:

I'm using scrapy to crawl my sitemap, to check for 404, 302 and 200 pages. But i can't seem to be able to get the response code. This is my code so far:

from scrapy.contrib.spiders import SitemapSpider


class TothegoSitemapHomesSpider(SitemapSpider):
    name ='tothego_homes_spider'

    ## robe che ci servono per tothego ##
   sitemap_urls = []
   ok_log_file =       '/opt/Workspace/myapp/crawler/valid_output/ok_homes'
   bad_log_file =      '/opt/Workspace/myapp/crawler/bad_homes'
   fourohfour =        '/opt/Workspace/myapp/crawler/404/404_homes'

   def __init__(self, **kwargs):
        SitemapSpider.__init__(self)

        if len(kwargs) > 1:
            if 'domain' in kwargs:
                self.sitemap_urls = ['http://url_to_sitemap%s/sitemap.xml' % kwargs['domain']]

            if 'country' in kwargs:
                self.ok_log_file += "_%s.txt" % kwargs['country']
                self.bad_log_file += "_%s.txt" % kwargs['country']
                self.fourohfour += "_%s.txt" % kwargs['country']

        else:
            print "USAGE: scrapy [crawler_name] -a country=[country] -a domain=[domain] \nWith [crawler_name]:\n- tothego_homes_spider\n- tothego_cars_spider\n- tothego_jobs_spider\n"
            exit(1)

    def parse(self, response):
        try:
            if response.status == 404:
                ## 404 tracciate anche separatamente
                self.append(self.bad_log_file, response.url)
                self.append(self.fourohfour, response.url)

            elif response.status == 200:
                ## printa su ok_log_file
                self.append(self.ok_log_file, response.url)
            else:
                self.append(self.bad_log_file, response.url)

        except Exception, e:
            self.log('[eccezione] : %s' % e)
            pass

    def append(self, file, string):
        file = open(file, 'a')
        file.write(string+"\n")
        file.close()

From scrapy's docs, they said that response.status parameter is an integer corresponding to the status code of the response. So far, it logs only the 200 status urls, while the 302 aren't written on the output file (but i can see the redirects in crawl.log). So, what do i have to do to "trap" the 302 requests and save those urls?

653

asked Mar 14 '12 08:03

Samuele Mattiuzzo

2 Answers

http://readthedocs.org/docs/scrapy/en/latest/topics/spider-middleware.html#module-scrapy.contrib.spidermiddleware.httperror

Assuming default spider middleware is enabled, response codes outside of the 200-300 range are filtered out by HttpErrorMiddleware. You can tell the middleware you want to handle 404s by setting the handle_httpstatus_list attribute on your spider.

class TothegoSitemapHomesSpider(SitemapSpider):
    handle_httpstatus_list = [404]

answered Oct 17 '22 13:10

njbooher

Only to have a complete response here:

Set Handle_httpstatus_list = [302];
On request, set dont_redirect to True on meta.

For example: Request(URL, meta={'dont_redirect': True});

answered Oct 17 '22 13:10

Ricardo Lucca

Related questions
                            
                                Suppress multiple messages with same content in Python logging module AKA log compression
                            
                                Pandas - Interleave / Zip two DataFrames by row
                            
                                Using Grouped Map Pandas UDFs with arguments
                            
                                Function annotation with two or more return parameters
                            
                                Crontab can't execute python script with error: "[Errno 1] Operation not permitted"
                            
                                How to calculate distance for every row in a pandas dataframe from a single point efficiently?
                            
                                Working with subdomain in google app engine
                            
                                SyntaxError inconsistency in Python?
                            
                                Django adminsite customize search_fields query
                            
                                How find values in an array that meet two conditions using Python
                            
                                Python list to store class instance?
                            
                                AttributeError when unpickling an object
                            
                                Using variables in Python regular expression [duplicate]
                            
                                How can I make a unique value priority queue in Python?
                            
                                How do I change nesting function's variable in the nested function
                            
                                Terminal text becomes invisible after terminating subprocess
                            
                                Using Python Fabric without the command-line tool (fab)
                            
                                python lazy variables? or, delayed expensive computation
                            
                                Understanding weird boolean 2d-array indexing behavior in numpy
                            
                                Clearing specific cache in Django

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrapy and response status code: how to check against it?

Tags:

python

http-status-codes

scrapy

Samuele Mattiuzzo

People also ask

2 Answers

njbooher

Ricardo Lucca

Recent Activity

Donate For Us