Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy and response status code: how to check against it?

I'm using scrapy to crawl my sitemap, to check for 404, 302 and 200 pages. But i can't seem to be able to get the response code. This is my code so far:

from scrapy.contrib.spiders import SitemapSpider


class TothegoSitemapHomesSpider(SitemapSpider):
    name ='tothego_homes_spider'

    ## robe che ci servono per tothego ##
   sitemap_urls = []
   ok_log_file =       '/opt/Workspace/myapp/crawler/valid_output/ok_homes'
   bad_log_file =      '/opt/Workspace/myapp/crawler/bad_homes'
   fourohfour =        '/opt/Workspace/myapp/crawler/404/404_homes'

   def __init__(self, **kwargs):
        SitemapSpider.__init__(self)

        if len(kwargs) > 1:
            if 'domain' in kwargs:
                self.sitemap_urls = ['http://url_to_sitemap%s/sitemap.xml' % kwargs['domain']]

            if 'country' in kwargs:
                self.ok_log_file += "_%s.txt" % kwargs['country']
                self.bad_log_file += "_%s.txt" % kwargs['country']
                self.fourohfour += "_%s.txt" % kwargs['country']

        else:
            print "USAGE: scrapy [crawler_name] -a country=[country] -a domain=[domain] \nWith [crawler_name]:\n- tothego_homes_spider\n- tothego_cars_spider\n- tothego_jobs_spider\n"
            exit(1)

    def parse(self, response):
        try:
            if response.status == 404:
                ## 404 tracciate anche separatamente
                self.append(self.bad_log_file, response.url)
                self.append(self.fourohfour, response.url)

            elif response.status == 200:
                ## printa su ok_log_file
                self.append(self.ok_log_file, response.url)
            else:
                self.append(self.bad_log_file, response.url)

        except Exception, e:
            self.log('[eccezione] : %s' % e)
            pass

    def append(self, file, string):
        file = open(file, 'a')
        file.write(string+"\n")
        file.close()

From scrapy's docs, they said that response.status parameter is an integer corresponding to the status code of the response. So far, it logs only the 200 status urls, while the 302 aren't written on the output file (but i can see the redirects in crawl.log). So, what do i have to do to "trap" the 302 requests and save those urls?

like image 653
Samuele Mattiuzzo Avatar asked Mar 14 '12 08:03

Samuele Mattiuzzo


People also ask

How do you get a Scrapy response?

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.

How do you use Scrapy request?

Scrapy can crawl websites using the Request and Response objects. The request objects pass over the system, uses the spiders to execute the request and get back to the request when it returns a response object.

Does beautiful soup use spiders?

BeautifulSoup and lxml are libraries for parsing HTML and XML. Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them.


2 Answers

http://readthedocs.org/docs/scrapy/en/latest/topics/spider-middleware.html#module-scrapy.contrib.spidermiddleware.httperror

Assuming default spider middleware is enabled, response codes outside of the 200-300 range are filtered out by HttpErrorMiddleware. You can tell the middleware you want to handle 404s by setting the handle_httpstatus_list attribute on your spider.

class TothegoSitemapHomesSpider(SitemapSpider):
    handle_httpstatus_list = [404]
like image 98
njbooher Avatar answered Oct 17 '22 13:10

njbooher


Only to have a complete response here:

  • Set Handle_httpstatus_list = [302];

  • On request, set dont_redirect to True on meta.

For example: Request(URL, meta={'dont_redirect': True});

like image 39
Ricardo Lucca Avatar answered Oct 17 '22 13:10

Ricardo Lucca